Diagnóstico del cáncer de mama usando sparkR
===

* *30 min* | Última modificación: Junio 22, 2019

En este documento se ilustra la construcción de modelos de clasificación usando SparkR. El tutorial está centrado en el uso del lenguaje y se supone suficiencia del lector en el uso e interpretación de algoritmos de clasificación. 

## Definición del problema

Se desea determinar si una masa mamaria es un tumor benigno o maligno, a partir de las medidas obtenidas de imágenes digitalizadas de la aspiración con una aguja fina. Los valores representan las características de los núcleos celulares presentes en la imagen digital. 

Se tiene una muestra de 569 ejemplos de resultados de las biopsias. Cada registro contiene 32 variables, las cuales corresponden a tres medidas (media, desviación estándar, peor caso) de diez características diferentes (radius, texture, ...).

* Identification number
* Cancer diagnosis ("M" para maligno y "B" para benigno)
* Radius
* Texture
* Perimeter
* Area
* Smoothness
* Compactness
* Concavity
* Concave points
* Symmetry
* Fractal dimension

En términos de los datos, se desea pronosticar si una masa es benigna o maligna (clase B o M) a partir de las 30 variables.

Fuente de los datos: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

## Solución

In [1]:
##
## Esta función se usará para ejecutar comandos en el 
## sistema operativo y capturar la salida.
##
systemp <- function(command) cat(system(command, intern = TRUE), sep = '\n')

In [2]:
##
## Se procede a la carga de la librería
##
library(SparkR)
sparkR.session(enableHiveSupport = FALSE)


Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
    rank, rbind, sample, startsWith, subset, summary, transform, union

Spark package found in SPARK_HOME: /usr/local/spark


Launching java with spark-submit command /usr/local/spark/bin/spark-submit   sparkr-shell /tmp/RtmpCgVAK3/backend_port12927997ee89 


Java ref type org.apache.spark.sql.SparkSession id 1 

### Exploración

Se mueve el archivo de la máquina local a sistema HDFS.

In [3]:
## copia el archivo al HDFS
systemp('hdfs dfs -copyFromLocal wisc_bc_data.csv /tmp/wisc_bc_data.csv') 




In [4]:
df <- 
read.df(
    '/tmp/wisc_bc_data.csv',  # ubicación y nombre del archivo
    'csv',                    # formato
    header = TRUE)            # encabeamiento

head(df)

id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244


In [5]:
##
## Se imprime el esquema en formato de arbol
##
printSchema(df)

root
 |-- id: string (nullable = true)
 |-- diagnosis: string (nullable = true)
 |-- radius_mean: string (nullable = true)
 |-- texture_mean: string (nullable = true)
 |-- perimeter_mean: string (nullable = true)
 |-- area_mean: string (nullable = true)
 |-- smoothness_mean: string (nullable = true)
 |-- compactness_mean: string (nullable = true)
 |-- concavity_mean: string (nullable = true)
 |-- concave_points_mean: string (nullable = true)
 |-- symmetry_mean: string (nullable = true)
 |-- fractal_dimension_mean: string (nullable = true)
 |-- radius_se: string (nullable = true)
 |-- texture_se: string (nullable = true)
 |-- perimeter_se: string (nullable = true)
 |-- area_se: string (nullable = true)
 |-- smoothness_se: string (nullable = true)
 |-- compactness_se: string (nullable = true)
 |-- concavity_se: string (nullable = true)
 |-- concave_points_se: string (nullable = true)
 |-- symmetry_se: string (nullable = true)
 |-- fractal_dimension_se: string (nullable = true)
 |-- radiu

#### Regresión Logística

In [6]:
##
## Se especifica el modelo de la forma usual
##
model <- spark.logit(
    df, 
    diagnosis ~ radius_mean + texture_mean + perimeter_mean + area_mean + 
                smoothness_mean + compactness_mean + concavity_mean + 
                concave_points_mean + symmetry_mean + 
                fractal_dimension_mean + radius_se + texture_se + 
                perimeter_se + area_se + smoothness_se + compactness_se + 
                concavity_se + concave_points_se + symmetry_se + 
                fractal_dimension_se + radius_worst + texture_worst + 
                perimeter_worst + area_worst + smoothness_worst + 
                compactness_worst + concavity_worst + 
                concave_points_worst + symmetry_worst + 
                fractal_dimension_worst,
    maxIter = 100, 
    regParam = 0.0, 
    elasticNetParam = 0.0)

# Prediction
fitted_logit <- predict(model, df)
head(fitted_logit)

id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst,rawPrediction,probability,prediction
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<list>,<list>,<chr>
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,<environment: 0x55b20360b468>,<environment: 0x55b20363d6b8>,M
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,<environment: 0x55b203618eb8>,<environment: 0x55b203640570>,M
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,<environment: 0x55b20361f7b0>,<environment: 0x55b203646fb8>,M
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,<environment: 0x55b2036229b0>,<environment: 0x55b20364cc70>,M
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,<environment: 0x55b20362f698>,<environment: 0x55b203653c08>,M
843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,<environment: 0x55b203632940>,<environment: 0x55b203657370>,M


### Random forest classifier

In [7]:
##
## Se especifica el modelo de la forma usual
##
model <- spark.randomForest(
    df, 
    diagnosis ~ radius_mean + texture_mean + perimeter_mean + area_mean + 
                smoothness_mean + compactness_mean + concavity_mean + 
                concave_points_mean + symmetry_mean + 
                fractal_dimension_mean + radius_se + texture_se + 
                perimeter_se + area_se + smoothness_se + compactness_se + 
                concavity_se + concave_points_se + symmetry_se + 
                fractal_dimension_se + radius_worst + texture_worst + 
                perimeter_worst + area_worst + smoothness_worst + 
                compactness_worst + concavity_worst + 
                concave_points_worst + symmetry_worst + 
                fractal_dimension_worst,    
    "classification",
    numTrees = 10)

# Prediction
fitted_randomForest <- predict(model, df)
head(fitted_randomForest)

id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst,rawPrediction,probability,prediction
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<list>,<list>,<chr>
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,<environment: 0x55b200529670>,<environment: 0x55b2004990a0>,B
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,<environment: 0x55b2005145f8>,<environment: 0x55b2004910d8>,B
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,<environment: 0x55b200503da8>,<environment: 0x55b1ff5be858>,B
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,<environment: 0x55b2004fc518>,<environment: 0x55b1ff635c28>,B
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,<environment: 0x55b2004f04e0>,<environment: 0x55b1ff630558>,B
843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,<environment: 0x55b2004b5ae8>,<environment: 0x55b1ff119578>,B


#### Gradient-boosted tree classifier

In [8]:
model <- spark.gbt(
    df, 
    diagnosis ~ radius_mean + texture_mean + perimeter_mean + area_mean + 
                smoothness_mean + compactness_mean + concavity_mean + 
                concave_points_mean + symmetry_mean + 
                fractal_dimension_mean + radius_se + texture_se + 
                perimeter_se + area_se + smoothness_se + compactness_se + 
                concavity_se + concave_points_se + symmetry_se + 
                fractal_dimension_se + radius_worst + texture_worst + 
                perimeter_worst + area_worst + smoothness_worst + 
                compactness_worst + concavity_worst + 
                concave_points_worst + symmetry_worst + 
                fractal_dimension_worst,    
    "classification",
    maxIter = 50)

# Prediction
fitted_gbt <- predict(model, df)
head(fitted_gbt)

id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst,rawPrediction,probability,prediction
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<list>,<list>,<chr>
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,<environment: 0x55b201eee528>,<environment: 0x55b201f1c618>,M
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,<environment: 0x55b201efded0>,<environment: 0x55b201f29ea8>,M
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,<environment: 0x55b201f03060>,<environment: 0x55b201f2ffc0>,M
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,<environment: 0x55b201f09530>,<environment: 0x55b201f38ab0>,B
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,<environment: 0x55b201f0eb90>,<environment: 0x55b201f3ef80>,M
843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,<environment: 0x55b201f16f48>,<environment: 0x55b201f41bd0>,M
