# Análisis del conjunto de datos "Flor Iris" mediante un modelo de red neuronal MLPC

![](https://www.saedsayad.com/images/Perceptron_bkp_1.png)

## Aproximación mediante el método de agrupamiento k-Means

* ROJO   ->   Setosa
* AZUL   ->   Versicolor
* VERDE  ->   Virginica

![](kmeans_fallido.png)

Se puede comprobar que el cálculo de los centroides no es muy efectivo para este conjunto de datos ya que se pueden observar agrupaciones de datos muy dispersos

## Aproximación mediante MLPC

### Importación de dependencias

In [36]:
import $ivy.`org.apache.spark::spark-sql:2.4.4`
import $ivy.`org.apache.spark::spark-mllib:2.4.4`
import org.apache.spark.sql._
import org.apache.log4j.{Level, Logger}
import java.sql.Timestamp
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.types._
import org.apache.spark.ml.feature._

Logger.getLogger("org").setLevel(Level.OFF)

[32mimport [39m[36m$ivy.$                                  
[39m
[32mimport [39m[36m$ivy.$                                    
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[32mimport [39m[36mjava.sql.Timestamp
[39m
[32mimport [39m[36morg.apache.spark.sql.{Dataset, SparkSession}
[39m
[32mimport [39m[36morg.apache.spark.sql.types._
[39m
[32mimport [39m[36morg.apache.spark.ml.feature._

[39m

### Sesión de spark en el entorno del notebook

In [37]:
val spark = {
  NotebookSparkSession.builder()
    .master("local[*]")
    .getOrCreate()
}

Getting spark JARs
Creating SparkSession


[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@5f6165a4

### Estructura para el almacenamiento de los datos de entrada

In [38]:
val schema = StructType(
      StructField("Longitud_Sepalo", DoubleType, nullable = false) ::
        StructField("Anchura_Sepalo", DoubleType, nullable = true) ::
        StructField("Longitud_Petalo", DoubleType, nullable = true) ::
        StructField("Anchura_Petalo", DoubleType, nullable = true) ::
        StructField("Label", StringType, nullable = true) :: Nil)

[36mschema[39m: [32mStructType[39m = [33mStructType[39m(
  [33mStructField[39m([32m"Longitud_Sepalo"[39m, DoubleType, false, {}),
  [33mStructField[39m([32m"Anchura_Sepalo"[39m, DoubleType, true, {}),
  [33mStructField[39m([32m"Longitud_Petalo"[39m, DoubleType, true, {}),
  [33mStructField[39m([32m"Anchura_Petalo"[39m, DoubleType, true, {}),
  [33mStructField[39m([32m"Label"[39m, StringType, true, {})
)

### Capturando los datos de entrada

In [39]:
val csvData = spark.read.options(Map(
    "header" -> "true",
    "ignoreLeadingWhiteSpace" -> "true",
    "ignoreTrailingWhiteSpace" -> "true",
    "timestampFormat" -> "MM/dd/yyyy HH:mm:ss a",
    "mode" -> "DROPMALFORMED"))
    .schema(schema)
    .csv("C:/sw/iris.csv")

[36mcsvData[39m: [32mDataFrame[39m = [Longitud_Sepalo: double, Anchura_Sepalo: double ... 3 more fields]

### Muestra de los datos obtenidos

In [40]:
csvData.show(5)

+---------------+--------------+---------------+--------------+------+
|Longitud_Sepalo|Anchura_Sepalo|Longitud_Petalo|Anchura_Petalo| Label|
+---------------+--------------+---------------+--------------+------+
|            5.1|           3.5|            1.4|           0.2|setosa|
|            4.9|           3.0|            1.4|           0.2|setosa|
|            4.7|           3.2|            1.3|           0.2|setosa|
|            4.6|           3.1|            1.5|           0.2|setosa|
|            5.0|           3.6|            1.4|           0.2|setosa|
+---------------+--------------+---------------+--------------+------+
only showing top 5 rows



### Se crea una columna que agrupa las características en forma de vector

In [41]:
val assembler = new VectorAssembler()
  .setInputCols(Array("Longitud_Sepalo", "Anchura_Sepalo", "Longitud_Petalo", "Anchura_Petalo"))
  .setOutputCol("Features")

val preFitData = assembler.transform(csvData)
features.show(5)

+---------------+--------------+---------------+--------------+------+-----------------+
|Longitud_Sepalo|Anchura_Sepalo|Longitud_Petalo|Anchura_Petalo| Label|         Features|
+---------------+--------------+---------------+--------------+------+-----------------+
|            5.1|           3.5|            1.4|           0.2|setosa|[5.1,3.5,1.4,0.2]|
|            4.9|           3.0|            1.4|           0.2|setosa|[4.9,3.0,1.4,0.2]|
|            4.7|           3.2|            1.3|           0.2|setosa|[4.7,3.2,1.3,0.2]|
|            4.6|           3.1|            1.5|           0.2|setosa|[4.6,3.1,1.5,0.2]|
|            5.0|           3.6|            1.4|           0.2|setosa|[5.0,3.6,1.4,0.2]|
+---------------+--------------+---------------+--------------+------+-----------------+
only showing top 5 rows



[36massembler[39m: [32mVectorAssembler[39m = vecAssembler_051b3d42c506
[36mpreFitData[39m: [32mDataFrame[39m = [Longitud_Sepalo: double, Anchura_Sepalo: double ... 4 more fields]

In [42]:
val labelIndexer = new StringIndexer().setInputCol("Label").setOutputCol("LabelIndexado").fit(preFitData)
val etiquetas = labelIndexer.labels.mkString("[", ", ", "]")
println(s"Tipos de labels encontrados: $etiquetas")

Tipos de labels encontrados: [versicolor, virginica, setosa]


[36mlabelIndexer[39m: [32mStringIndexerModel[39m = strIdx_05345d82852f
[36metiquetas[39m: [32mString[39m = [32m"[versicolor, virginica, setosa]"[39m

In [43]:
val featureIndexer = new VectorIndexer().setInputCol("Features").setOutputCol("FeaturesIndexadas").setMaxCategories(4).fit(preFitData)

[36mfeatureIndexer[39m: [32mVectorIndexerModel[39m = vecIdx_42fdaa79e3dd

### Reparto de registros para entrenamiento y pruebas

In [44]:
val splits = preFitData.randomSplit(Array(0.6, 0.4))
val trainingData = splits(0)
val testData = splits(1)

[36msplits[39m: [32mArray[39m[[32mDataset[39m[[32mRow[39m]] = [33mArray[39m(
  [Longitud_Sepalo: double, Anchura_Sepalo: double ... 4 more fields],
  [Longitud_Sepalo: double, Anchura_Sepalo: double ... 4 more fields]
)
[36mtrainingData[39m: [32mDataset[39m[[32mRow[39m] = [Longitud_Sepalo: double, Anchura_Sepalo: double ... 4 more fields]
[36mtestData[39m: [32mDataset[39m[[32mRow[39m] = [Longitud_Sepalo: double, Anchura_Sepalo: double ... 4 more fields]

### Estructura representativa de una red neuronal multicapa

![](https://dsc-spidal.github.io/harp/img/nn.png)

In [45]:
//Definición de capas que componen la red neuronal
val percentronsCapaEntrada = 4
val percentronsCapaIntermedia = 5
val percentronsCapaSalida = 3
val layers = Array[Int](percentronsCapaEntrada, percentronsCapaIntermedia, percentronsCapaIntermedia, percentronsCapaSalida)

[36mpercentronsCapaEntrada[39m: [32mInt[39m = [32m4[39m
[36mpercentronsCapaIntermedia[39m: [32mInt[39m = [32m5[39m
[36mpercentronsCapaSalida[39m: [32mInt[39m = [32m3[39m
[36mlayers[39m: [32mArray[39m[[32mInt[39m] = [33mArray[39m([32m4[39m, [32m5[39m, [32m5[39m, [32m3[39m)

### Se configura el entrenador

In [46]:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier

// create the trainer and set its parameters
val trainer = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setLabelCol("LabelIndexado")
  .setFeaturesCol("FeaturesIndexadas")
  .setBlockSize(128)
  .setSeed(System.currentTimeMillis)
  .setMaxIter(200)

[32mimport [39m[36morg.apache.spark.ml.classification.MultilayerPerceptronClassifier

// create the trainer and set its parameters
[39m
[36mtrainer[39m: [32mMultilayerPerceptronClassifier[39m = mlpc_c15e9f68b8ac

In [47]:
import org.apache.spark.ml.feature.IndexToString

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("LabelPredicho")
  .setLabels(labelIndexer.labels)

[32mimport [39m[36morg.apache.spark.ml.feature.IndexToString

// Convert indexed labels back to original labels.
[39m
[36mlabelConverter[39m: [32mIndexToString[39m = idxToStr_2605157052f6

In [48]:
import org.apache.spark.ml.Pipeline

// Chain indexers and MultilayerPerceptronClassifier in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, trainer, labelConverter))

[32mimport [39m[36morg.apache.spark.ml.Pipeline

// Chain indexers and MultilayerPerceptronClassifier in a Pipeline.
[39m
[36mpipeline[39m: [32mPipeline[39m = pipeline_5a960d9989df

In [None]:
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

19/10/24 18:21:30 INFO StrongWolfeLineSearch: Line search t: 1.5326785113182582 fval: 1.0935071820474427 rhs: 1.1334171003108582 cdd: -0.001426756243714009
19/10/24 18:21:30 INFO LBFGS: Step Size: 1,533
19/10/24 18:21:30 INFO LBFGS: Val and Grad Norm: 1,09351 (rel: 0,0352) 0,0383731


19/10/24 18:21:31 INFO LBFGS: Step Size: 3,375
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 1,08582 (rel: 0,00703) 0,0473219


19/10/24 18:21:31 INFO StrongWolfeLineSearch: Line search t: 0.27377832365441535 fval: 1.0804894567167258 rhs: 1.0858199930450607 cdd: 0.034238023758438936
19/10/24 18:21:31 INFO LBFGS: Step Size: 0,2738
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 1,08049 (rel: 0,00491) 0,0739352


19/10/24 18:21:31 INFO LBFGS: Step Size: 2,250
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 1,04890 (rel: 0,0292) 0,116739


19/10/24 18:21:31 INFO StrongWolfeLineSearch: Line search t: 0.10673684511029435 fval: 1.0417543369649103 rhs: 1.0488976409638553 cdd: -0.03193771348092057
19/10/24 18:21:31 INFO LBFGS: Step Size: 0,1067
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 1,04175 (rel: 0,00681) 0,197781


19/10/24 18:21:31 INFO LBFGS: Step Size: 3,375
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,798528 (rel: 0,233) 1,18339


19/10/24 18:21:31 INFO StrongWolfeLineSearch: Line search t: 0.20292542043279005 fval: 0.6966826750959184 rhs: 0.7984848774505853 cdd: 0.758047534369334
19/10/24 18:21:31 INFO LBFGS: Step Size: 0,2029
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,696683 (rel: 0,128) 0,531562


19/10/24 18:21:31 INFO LBFGS: Step Size: 1,000
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,588965 (rel: 0,155) 0,452693


19/10/24 18:21:31 INFO LBFGS: Step Size: 1,000
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,525634 (rel: 0,108) 0,250289


19/10/24 18:21:31 INFO LBFGS: Step Size: 1,000
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,510368 (rel: 0,0290) 0,189873


19/10/24 18:21:31 INFO LBFGS: Step Size: 1,000
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,493171 (rel: 0,0337) 0,0987314


19/10/24 18:21:31 INFO LBFGS: Step Size: 1,000
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,484519 (rel: 0,0175) 0,0405984


19/10/24 18:21:31 INFO LBFGS: Step Size: 1,000
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,480487 (rel: 0,00832) 0,0122392


19/10/24 18:21:31 INFO LBFGS: Step Size: 1,000
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,478445 (rel: 0,00425) 0,0153590


19/10/24 18:21:31 INFO LBFGS: Step Size: 1,000
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,476528 (rel: 0,00401) 0,0232523


19/10/24 18:21:31 INFO LBFGS: Step Size: 1,000
19/10/24 18:21:31 INFO LBFGS: Val and Grad Norm: 0,473128 (rel: 0,00713) 0,0357040


In [None]:
// Make predictions.
val predictions = model.transform(testData)

In [None]:
// Select example rows to display.
predictions.show(5)

In [None]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("LabelIndexado")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test Error = " + (1.0 - accuracy))

In [None]:
import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel

val mlpc = model.stages(2).asInstanceOf[MultilayerPerceptronClassificationModel]

println(s"Modelo de clasificación:\n$mlpc")
println(s"Parámetros: ${mlpc.explainParams}")
println(s"Pesos: ${mlpc.weights}")