# Regressione Logistica con Spark MLlib

In questo notebook vedremo come eseguire una semplice classificazione utilizzando un modello di **Regressione Logistica** con il modulo MLlib di Spark. Il modello che andremo ha creare ha lo scopo di identificare tumori al seno maligni da delle informazioni estratte da delle agobiopsie.

## Importazione delle librerie

In [1]:
import os
import shutil
import pandas as pd

In [2]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

In [3]:
from pyspark.sql.functions import count, max, col, year, sum

## Inizializzazione di Spark

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('basic').getOrCreate()
from pyspark.sql.types import *

## Caricamento del dataset

Scarico da Kaggle lo .zip contenente il dataset.

In [5]:
!kaggle datasets download uciml/breast-cancer-wisconsin-data

breast-cancer-wisconsin-data.zip: Skipping, found more recently modified local copy (use --force to force download)


Estraggo il dataset dal .zip

In [6]:
shutil.unpack_archive('breast-cancer-wisconsin-data.zip') 

Ora carico il dataset.

In [7]:
cancer_df = spark.read.csv("data.csv", header = True)

In [8]:
cancer_df.show(2)

+------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+----+
|    id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|_c32|
+------+---------+-----------+------------

In [9]:
print('Il dataset contiene ' + str(len(cancer_df.columns)) + ' colonne')

Il dataset contiene 33 colonne


Il DataFrame ha 31 colonne:

* **30 features**: che rappresentano delle proprietà dell'immagine, come raggio, area e perimetro.
* **1 target**: che è la colonna **diagnosis**, un valore di 'M' indica un tumore maligno, al contrario un valore di 'B' indica un tumore benigno.

## Preprocessing dei dati

In [10]:
cancer_df.count()

569

### Modifica dello schema

Vediamo lo schema.

In [11]:
cancer_df.printSchema()

root
 |-- id: string (nullable = true)
 |-- diagnosis: string (nullable = true)
 |-- radius_mean: string (nullable = true)
 |-- texture_mean: string (nullable = true)
 |-- perimeter_mean: string (nullable = true)
 |-- area_mean: string (nullable = true)
 |-- smoothness_mean: string (nullable = true)
 |-- compactness_mean: string (nullable = true)
 |-- concavity_mean: string (nullable = true)
 |-- concave points_mean: string (nullable = true)
 |-- symmetry_mean: string (nullable = true)
 |-- fractal_dimension_mean: string (nullable = true)
 |-- radius_se: string (nullable = true)
 |-- texture_se: string (nullable = true)
 |-- perimeter_se: string (nullable = true)
 |-- area_se: string (nullable = true)
 |-- smoothness_se: string (nullable = true)
 |-- compactness_se: string (nullable = true)
 |-- concavity_se: string (nullable = true)
 |-- concave points_se: string (nullable = true)
 |-- symmetry_se: string (nullable = true)
 |-- fractal_dimension_se: string (nullable = true)
 |-- radiu

Modifichiamo lo schema.

In [12]:
data_schema = [StructField('id', IntegerType()),
               StructField('diagnosis', StringType()),
               StructField('radius_mean', FloatType()),
               StructField('texture_mean', FloatType()),
               StructField('perimeter_mean', FloatType()),
               StructField('area_mean', FloatType()),
               StructField('smoothness_mean', FloatType()),
               StructField('compactness_mean', FloatType()),
               StructField('concavity_mean', FloatType()),
               StructField('concave points_mean', FloatType()),
               StructField('symmetry_mean', FloatType()),
               StructField('fractal_dimension_mean', FloatType()),
               StructField('radius_se', FloatType()),
               StructField('texture_se', FloatType()),
               StructField('perimeter_se', FloatType()),
               StructField('area_se', FloatType()),
               StructField('smoothness_se', FloatType()),
               StructField('compactness_se', FloatType()),
               StructField('concavity_se', FloatType()),
               StructField('concave points_se', FloatType()),
               StructField('symmetry_se', FloatType()),
               StructField('fractal_dimension_se', FloatType()),
               StructField('radius_worst', FloatType()),
               StructField('texture_worst', FloatType()),
               StructField('perimeter_worst', FloatType()),
               StructField('area_worst', FloatType()),
               StructField('smoothness_worst', FloatType()),
               StructField('compactness_worst', FloatType()),
               StructField('concavity_worst', FloatType()),
               StructField('concave points_worst', FloatType()),
               StructField('symmetry_worst', FloatType()),
               StructField('fractal_dimension_worst', FloatType())
               ]

schema = StructType(fields = data_schema)

In [13]:
cancer_df = spark.read.csv("data.csv", header = True, schema = schema)

In [14]:
cancer_df.show(2)

+------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+
|    id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|
+------+---------+-----------+------------+---------

Dal dataset si nota che la variabile traget, **diagnosis**, può assumere solo due valori.

In [15]:
cancer_df.groupBy('diagnosis').agg({'diagnosis':'count'}).show()

+---------+----------------+
|diagnosis|count(diagnosis)|
+---------+----------------+
|        B|             357|
|        M|             212|
+---------+----------------+



* **B**: diagnosi di tumore benigno
* **M**: diagnosi di tumore maligno

Andiamo a convertire questa categoria in una variabile binaria.

* **B** --> **0**
* **M** --> **1**

In [16]:
cancer_df = cancer_df.withColumn('diagnosis', regexp_replace('diagnosis', 'M', '1'))
cancer_df = cancer_df.withColumn('diagnosis', regexp_replace('diagnosis', 'B', '0'))

In [17]:
from pyspark.sql.types import IntegerType

cancer_df = cancer_df.withColumn("diagnosis", cancer_df["diagnosis"].cast(IntegerType()))

In [18]:
cancer_df.show(2)

+------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+
|    id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|
+------+---------+-----------+------------+---------

In [19]:
cancer_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- diagnosis: integer (nullable = true)
 |-- radius_mean: float (nullable = true)
 |-- texture_mean: float (nullable = true)
 |-- perimeter_mean: float (nullable = true)
 |-- area_mean: float (nullable = true)
 |-- smoothness_mean: float (nullable = true)
 |-- compactness_mean: float (nullable = true)
 |-- concavity_mean: float (nullable = true)
 |-- concave points_mean: float (nullable = true)
 |-- symmetry_mean: float (nullable = true)
 |-- fractal_dimension_mean: float (nullable = true)
 |-- radius_se: float (nullable = true)
 |-- texture_se: float (nullable = true)
 |-- perimeter_se: float (nullable = true)
 |-- area_se: float (nullable = true)
 |-- smoothness_se: float (nullable = true)
 |-- compactness_se: float (nullable = true)
 |-- concavity_se: float (nullable = true)
 |-- concave points_se: float (nullable = true)
 |-- symmetry_se: float (nullable = true)
 |-- fractal_dimension_se: float (nullable = true)
 |-- radius_worst: float (nu

### Valori nulli

Innanziutto verifichiamo la presenza di valori nulli

In [20]:
cancer_df.filter(cancer_df["id"].isNull() | \
                  cancer_df["diagnosis"].isNull() | \
                  cancer_df["radius_mean"].isNull() | \
                  cancer_df["texture_mean"].isNull() | \
                  cancer_df["perimeter_mean"].isNull() | \
                  cancer_df["area_mean"].isNull() | \
                  cancer_df["smoothness_mean"].isNull() | \
                  cancer_df["compactness_mean"].isNull() | \
                  cancer_df["concavity_mean"].isNull() | \
                  cancer_df["concave points_mean"].isNull() | \
                  cancer_df["symmetry_mean"].isNull() | \
                  cancer_df["fractal_dimension_mean"].isNull() | \
                  cancer_df["radius_se"].isNull() | \
                  cancer_df["texture_se"].isNull() | \
                  cancer_df["perimeter_se"].isNull() | \
                  cancer_df["area_se"].isNull() | \
                  cancer_df["smoothness_se"].isNull() | \
                  cancer_df["compactness_se"].isNull() | \
                  cancer_df["concavity_se"].isNull() | \
                  cancer_df["concave points_se"].isNull() | \
                  cancer_df["symmetry_se"].isNull() | \
                  cancer_df["fractal_dimension_se"].isNull() | \
                  cancer_df["radius_worst"].isNull() | \
                  cancer_df["texture_worst"].isNull() | \
                  cancer_df["perimeter_worst"].isNull() | \
                  cancer_df["area_worst"].isNull() | \
                  cancer_df["smoothness_worst"].isNull() | \
                  cancer_df["compactness_worst"].isNull() | \
                  cancer_df["concavity_worst"].isNull() | \
                  cancer_df["concave points_worst"].isNull() | \
                  cancer_df["symmetry_worst"].isNull() | \
                  cancer_df["fractal_dimension_worst"].isNull() \
                  ).count()

0

Non ci sono valori nulli nel nostro dataset.

### VectorAssembler

La classe MLlib richiede che le features si trovino tutte all'interno di un unico vettore su di una colonna, possiamo creare questa rappresentazione utilizzando la classe **VectorAssemlber** di MLlib.

In [21]:
from pyspark.ml.feature import VectorAssembler

Genero una lista con i nomi delle colonne che saranno le features del nostro modello, cioè tutte le colonne meno il target (diagnosis).

In [22]:
columns = cancer_df.columns
columns.remove('diagnosis')
columns.remove('id')

In [23]:
columns

['radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'smoothness_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'perimeter_se',
 'area_se',
 'smoothness_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'symmetry_se',
 'fractal_dimension_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst']

In [24]:
assembler = VectorAssembler(inputCols=columns,outputCol='features')

In [25]:
cancer_df = assembler.transform(cancer_df)

In [26]:
cancer_df.show(2)

+------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+--------------------+
|    id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|            features|
+------+--

### Standardizzazione

E' buona norma portare le features in un range di valori comuni. Facciamolo tramite la standardizzazione che ci permette di contenere le varie colonne all'interno di una distribuzione normale standard, cioè una distribuzione con media 0 e deviazione standard 1. Possiamo eseguire la standardizzazione usando la classe StandardScaler di MLlib.

In [27]:
from pyspark.ml.feature import StandardScaler

In [28]:
ss = StandardScaler(inputCol='features', outputCol='scaled_features')

In [29]:
ss = ss.fit(cancer_df)

In [30]:
cancer_df = ss.transform(cancer_df)

In [31]:
cancer_df.show(5)

+--------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+--------------------+--------------------+
|      id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|       

### Train set e Test set

Ora possiamo creare i DataFrame per addestramento e test tramite il metodo **RandomSplit**. Assegnamo il 70% degli esempi al set di addestramento e il 30% al set di test.

In [32]:
train_df, test_df = cancer_df.randomSplit([0.7, 0.3])

In [33]:
print("%d esempi nel train set" % train_df.count())
print("%d esempi nel test set" % test_df.count())

394 esempi nel train set
175 esempi nel test set


## Regressione Logistica

Ora possiamo creare il modello di Regressione Logistica, usiamo la classe **LogisticRegression**, all'interno del costruttore dovremo passare due parametri:

* **featuresCol**: il nome della colonna con le features
* **labelCol**: il nome della colonna con il target

In [34]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol="scaled_features", labelCol="diagnosis")

Avviamo l'addestramento con il metodo fit, passando al suo interno il set di addetramento

In [35]:
lr = lr.fit(train_df)

## Valutazione del modello

Ora verifichiamone la qualità del modello generato.

Ora verifichiamone la qualità del modello testandolo sul test set.

In [36]:
evaluation = lr.evaluate(test_df)

Il metodo **evaluate** calcolerà diverse metriche che ci possono aiutare a comprendere la qualità del modello, vediamone alcune.

### Accuracy

L'**accuracy** indica semplicemente la percentuale di classificazioni che il nostro modello ha eseguito correttamente.

In [37]:
print('Accuracy: ' + str(evaluation.accuracy))

Accuracy: 0.9714285714285714


### Precision

La **precision** ci dice, tra le classificazioni eseguite per una data classe, quante sono effettivamente appartenenti a quella classe.

In [38]:
print('Precision: ' + str(evaluation.precisionByLabel))

Precision: [0.9736842105263158, 0.9672131147540983]


### Recall

Il **recall** ci dice quanti dei casi positivi il modello è riuscito a classificare correttamente.

In [39]:
print('Recall: ' + str(evaluation.recallByLabel))

Recall: [0.9823008849557522, 0.9516129032258065]


## Testing del modello

Ora che abbiamo addestrato e validato il nostro modello, testiamolo su nuovi dati. Una clinica ci invia un file CSV contenente i risultati dell'agobiopsia per 6 pazienti che hanno in cura, dobbiamo utilizzare il nostro modello per identificare eventuali tumori maligni. Scarichiamo il CSV

In [40]:
!wget --no-check-certificate https://raw.githubusercontent.com/ProfAI/bigdata/master/7%20-%20Machine%20Learning%20Supervisionato%20-%20Classificazione/data/exam_results.csv

--2021-01-27 22:48:31--  https://raw.githubusercontent.com/ProfAI/bigdata/master/7%20-%20Machine%20Learning%20Supervisionato%20-%20Classificazione/data/exam_results.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.80.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.80.133|:443... connected.
  Self-signed certificate encountered.
HTTP request sent, awaiting response... 200 OK
Length: 1178 (1.2K) [text/plain]
Saving to: 'exam_results.csv.1'

     0K .                                                     100% 4.21M=0s

2021-01-27 22:48:31 (4.21 MB/s) - 'exam_results.csv.1' saved [1178/1178]



#### Caricamento del dataset

In [41]:
dataset = spark.read.csv('exam_results.csv', header=True)

In [42]:
dataset.show()

+------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+
|    id|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|
+------+-----------+------------+--------------+---------+--------------

### Preprocessing del dataset

#### Modifica dello schema

Vediamo lo schema.

In [43]:
dataset.printSchema()

root
 |-- id: string (nullable = true)
 |-- radius_mean: string (nullable = true)
 |-- texture_mean: string (nullable = true)
 |-- perimeter_mean: string (nullable = true)
 |-- area_mean: string (nullable = true)
 |-- smoothness_mean: string (nullable = true)
 |-- compactness_mean: string (nullable = true)
 |-- concavity_mean: string (nullable = true)
 |-- concave points_mean: string (nullable = true)
 |-- symmetry_mean: string (nullable = true)
 |-- fractal_dimension_mean: string (nullable = true)
 |-- radius_se: string (nullable = true)
 |-- texture_se: string (nullable = true)
 |-- perimeter_se: string (nullable = true)
 |-- area_se: string (nullable = true)
 |-- smoothness_se: string (nullable = true)
 |-- compactness_se: string (nullable = true)
 |-- concavity_se: string (nullable = true)
 |-- concave points_se: string (nullable = true)
 |-- symmetry_se: string (nullable = true)
 |-- fractal_dimension_se: string (nullable = true)
 |-- radius_worst: string (nullable = true)
 |-- te

Modifico lo schema.

In [44]:
data_schema = [StructField('id', IntegerType()),
               StructField('radius_mean', FloatType()),
               StructField('texture_mean', FloatType()),
               StructField('perimeter_mean', FloatType()),
               StructField('area_mean', FloatType()),
               StructField('smoothness_mean', FloatType()),
               StructField('compactness_mean', FloatType()),
               StructField('concavity_mean', FloatType()),
               StructField('concave points_mean', FloatType()),
               StructField('symmetry_mean', FloatType()),
               StructField('fractal_dimension_mean', FloatType()),
               StructField('radius_se', FloatType()),
               StructField('texture_se', FloatType()),
               StructField('perimeter_se', FloatType()),
               StructField('area_se', FloatType()),
               StructField('smoothness_se', FloatType()),
               StructField('compactness_se', FloatType()),
               StructField('concavity_se', FloatType()),
               StructField('concave points_se', FloatType()),
               StructField('symmetry_se', FloatType()),
               StructField('fractal_dimension_se', FloatType()),
               StructField('radius_worst', FloatType()),
               StructField('texture_worst', FloatType()),
               StructField('perimeter_worst', FloatType()),
               StructField('area_worst', FloatType()),
               StructField('smoothness_worst', FloatType()),
               StructField('compactness_worst', FloatType()),
               StructField('concavity_worst', FloatType()),
               StructField('concave points_worst', FloatType()),
               StructField('symmetry_worst', FloatType()),
               StructField('fractal_dimension_worst', FloatType())
               ]

schema = StructType(fields = data_schema)

In [45]:
dataset = spark.read.csv('exam_results.csv', header=True, schema = schema)

In [46]:
dataset.show(2)

+------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+
|    id|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|
+------+-----------+------------+--------------+---------+--------------

In [47]:
dataset.printSchema()

root
 |-- id: integer (nullable = true)
 |-- radius_mean: float (nullable = true)
 |-- texture_mean: float (nullable = true)
 |-- perimeter_mean: float (nullable = true)
 |-- area_mean: float (nullable = true)
 |-- smoothness_mean: float (nullable = true)
 |-- compactness_mean: float (nullable = true)
 |-- concavity_mean: float (nullable = true)
 |-- concave points_mean: float (nullable = true)
 |-- symmetry_mean: float (nullable = true)
 |-- fractal_dimension_mean: float (nullable = true)
 |-- radius_se: float (nullable = true)
 |-- texture_se: float (nullable = true)
 |-- perimeter_se: float (nullable = true)
 |-- area_se: float (nullable = true)
 |-- smoothness_se: float (nullable = true)
 |-- compactness_se: float (nullable = true)
 |-- concavity_se: float (nullable = true)
 |-- concave points_se: float (nullable = true)
 |-- symmetry_se: float (nullable = true)
 |-- fractal_dimension_se: float (nullable = true)
 |-- radius_worst: float (nullable = true)
 |-- texture_worst: float (

#### Check valori nulli

Verifico la presenza di valori nulli.

In [48]:
cancer_df.filter(cancer_df["id"].isNull() | \
                  cancer_df["radius_mean"].isNull() | \
                  cancer_df["texture_mean"].isNull() | \
                  cancer_df["perimeter_mean"].isNull() | \
                  cancer_df["area_mean"].isNull() | \
                  cancer_df["smoothness_mean"].isNull() | \
                  cancer_df["compactness_mean"].isNull() | \
                  cancer_df["concavity_mean"].isNull() | \
                  cancer_df["concave points_mean"].isNull() | \
                  cancer_df["symmetry_mean"].isNull() | \
                  cancer_df["fractal_dimension_mean"].isNull() | \
                  cancer_df["radius_se"].isNull() | \
                  cancer_df["texture_se"].isNull() | \
                  cancer_df["perimeter_se"].isNull() | \
                  cancer_df["area_se"].isNull() | \
                  cancer_df["smoothness_se"].isNull() | \
                  cancer_df["compactness_se"].isNull() | \
                  cancer_df["concavity_se"].isNull() | \
                  cancer_df["concave points_se"].isNull() | \
                  cancer_df["symmetry_se"].isNull() | \
                  cancer_df["fractal_dimension_se"].isNull() | \
                  cancer_df["radius_worst"].isNull() | \
                  cancer_df["texture_worst"].isNull() | \
                  cancer_df["perimeter_worst"].isNull() | \
                  cancer_df["area_worst"].isNull() | \
                  cancer_df["smoothness_worst"].isNull() | \
                  cancer_df["compactness_worst"].isNull() | \
                  cancer_df["concavity_worst"].isNull() | \
                  cancer_df["concave points_worst"].isNull() | \
                  cancer_df["symmetry_worst"].isNull() | \
                  cancer_df["fractal_dimension_worst"].isNull() \
                  ).count()

0

Non ci sono valori nulli.

#### VectorAssembler

Creiamo la colonna con le features.

In [49]:
dataset = assembler.transform(dataset)

In [50]:
dataset.show()

+------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+--------------------+
|    id|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|            features|
+------+-----------+----------

#### Standardizzazione

Eseguiamo la standardizzazione

In [51]:
dataset = ss.transform(dataset)

In [52]:
dataset.show()

+------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+--------------------+--------------------+
|    id|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|            features|     scale

#### Applicazione del modello

Otteniamo le predizioni usando il metodo **transform** del modello

In [53]:
dataset = lr.transform(dataset)

In [54]:
dataset.columns    # da notare ultima colonna

['id',
 'radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'smoothness_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'perimeter_se',
 'area_se',
 'smoothness_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'symmetry_se',
 'fractal_dimension_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst',
 'features',
 'scaled_features',
 'rawPrediction',
 'probability',
 'prediction']

In [55]:
dataset.show()

+------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+--------------------+--------------------+--------------------+--------------------+----------+
|    id|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fr

Il DataFrame risultante contiene due nuove colonne:

* **Prediction**: che contiene il label (0=benigno, 1=maligno)
* **Probability**: che contiene la probabilità di apparteneza alle due classi.

In [57]:
dataset.select(['probability','prediction']).show(truncate=False)

+------------------------------------------+----------+
|probability                               |prediction|
+------------------------------------------+----------+
|[0.9999999999951821,4.817904750341217E-12]|0.0       |
|[0.9999999999414246,5.857536411252989E-11]|0.0       |
+------------------------------------------+----------+



I due tumori possono essere considerati entrambi benigni.