# Ejercicio práctico evaluable de Machine Learning con PySpark

En este notebook entrenaremos un modelo de clasificacion binaria capaz de predecir el cáncer de pulmón en base de varias características. Para ello utilizaremos el dataset  **lung_cancer.csv** (se proporciona vía GitHub)

In [0]:
import numpy as np 
import pandas as pd 



In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Lung Cancer').getOrCreate()

In [0]:
from pyspark.ml.classification import LogisticRegression

In [0]:


lung = spark.read.csv('/FileStore/tables/lung_cancer.csv', 
                       inferSchema = True, 
                       header = True)

In [0]:
lung.show(5)

+------+-----------+---+------+-----+------+------+
|  Name|    Surname|Age|Smokes|AreaQ|Alkhol|Result|
+------+-----------+---+------+-----+------+------+
|  John|       Wick| 35|     3|    5|     4|     1|
|  John|Constantine| 27|    20|    2|     5|     1|
|Camela|   Anderson| 30|     0|    5|     2|     0|
|  Alex|     Telles| 28|     0|    8|     1|     0|
| Diego|   Maradona| 68|     4|    5|     6|     1|
+------+-----------+---+------+-----+------+------+
only showing top 5 rows



In [0]:
lung.toPandas().head()

Unnamed: 0,Name,Surname,Age,Smokes,AreaQ,Alkhol,Result
0,John,Wick,35,3,5,4,1
1,John,Constantine,27,20,2,5,1
2,Camela,Anderson,30,0,5,2,0
3,Alex,Telles,28,0,8,1,0
4,Diego,Maradona,68,4,5,6,1


In [0]:
lung.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Surname: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Smokes: integer (nullable = true)
 |-- AreaQ: integer (nullable = true)
 |-- Alkhol: integer (nullable = true)
 |-- Result: integer (nullable = true)



### Preprocesamiento de datos

In [0]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [0]:
lung.columns

Out[10]: ['Name', 'Surname', 'Age', 'Smokes', 'AreaQ', 'Alkhol', 'Result']

In [0]:
assembler = VectorAssembler(
                            inputCols=[
                            'Age',
                            'Smokes',
                            'AreaQ',
                            'Alkhol'],
                            outputCol="features")

In [0]:
output = assembler.transform(lung)

In [0]:
output.show(5)

+------+-----------+---+------+-----+------+------+-------------------+
|  Name|    Surname|Age|Smokes|AreaQ|Alkhol|Result|           features|
+------+-----------+---+------+-----+------+------+-------------------+
|  John|       Wick| 35|     3|    5|     4|     1| [35.0,3.0,5.0,4.0]|
|  John|Constantine| 27|    20|    2|     5|     1|[27.0,20.0,2.0,5.0]|
|Camela|   Anderson| 30|     0|    5|     2|     0| [30.0,0.0,5.0,2.0]|
|  Alex|     Telles| 28|     0|    8|     1|     0| [28.0,0.0,8.0,1.0]|
| Diego|   Maradona| 68|     4|    5|     6|     1| [68.0,4.0,5.0,6.0]|
+------+-----------+---+------+-----+------+------+-------------------+
only showing top 5 rows



In [0]:
final_data = output.select("features",'Result')

### Entrenamiento del modelo

In [0]:
train, test = final_data.randomSplit([0.7,0.3])

In [0]:
lr = LogisticRegression(labelCol="Result",
                        featuresCol="features")

In [0]:
model = lr.fit(train)

In [0]:
predict_train=model.transform(train)
predict_test=model.transform(test)
predict_test.select("Result","prediction").show(10)

+------+----------+
|Result|prediction|
+------+----------+
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       0.0|
|     0|       0.0|
+------+----------+
only showing top 10 rows



### Evaluación del modelo

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction',
                                          labelCol='Result')

predict_test.select("Result","rawPrediction","prediction","probability").show(5)


+------+--------------------+----------+--------------------+
|Result|       rawPrediction|prediction|         probability|
+------+--------------------+----------+--------------------+
|     0|[30.7272302567969...|       0.0|[0.99999999999995...|
|     1|[-6.6538881828838...|       1.0|[0.00128734109639...|
|     1|[-21.698075089680...|       1.0|[3.77264308693437...|
|     0|[35.7393679666789...|       0.0|[0.99999999999999...|
|     0|[20.0982950368171...|       0.0|[0.99999999813180...|
+------+--------------------+----------+--------------------+
only showing top 5 rows



In [0]:
print("The area under ROC for train set is {}".format(evaluator.evaluate(predict_train)))

print("The area under ROC for test set is {}".format(evaluator.evaluate(predict_test)))

The area under ROC for train set is 1.0
The area under ROC for test set is 0.9876543209876543


In [0]:
dbutils.fs.cp("/FileStore/tables/heart.csv", "file:/tmp/heart.csv")


Out[24]: True