# Ejercicio Práctico_Predicción en Streaming con Spark ML y Spark Streaming

Ejercicio 1.- En este ejercicio vamos a importar y preprocesar los datos de 'data/heart.csv' que utilizaremos para entrenar un modelo de clasificación binaria con PySpark. Para ello, tendrás que inicializar una sesión de Spark, cargar los datos con el esquema correcto y analizar su distribución. Es decir, debes completar la parte de importación y análisis exploratorio de los datos.

In [4]:
import findspark
findspark.init()

In [5]:
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.types import *

In [6]:
from pyspark.sql import SparkSession
## Inicia una sesion de Spark
spark = SparkSession.builder.appName('UCI Heart disease').getOrCreate()

In [7]:
## Carga y visualiza el csv de Ejercicios\data\heart.csv con el nombre de heart
heart = spark.read.csv('C:/Users/ilse-/heart.csv', 
                       inferSchema = True, 
                       header = True)
heart.show(3)

+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+
|age|sex| cp|trestbps|chol|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|
+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+
| 63|  1|  3|     145| 233|  1|      0|    150|    0|    2.3|    0|  0|   1|     1|
| 37|  1|  2|     130| 250|  0|      1|    187|    0|    3.5|    0|  0|   2|     1|
| 41|  0|  1|     130| 204|  0|      0|    172|    0|    1.4|    2|  0|   2|     1|
+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+
only showing top 3 rows



In [8]:
schema = StructType( \
                     [StructField("age", LongType(),True), \
                      StructField("sex", LongType(), True), \
                      StructField("cp", LongType(), True), \
                      StructField('trestbps', LongType(), True), \
                      StructField("chol", LongType(), True), \
                      StructField("fbs", LongType(), True), \
                      StructField("restecg", LongType(), True), \
                      StructField("thalach", LongType(), True),\
                      StructField("exang", LongType(), True), \
                      StructField("oldpeak", DoubleType(), True), \
                      StructField("slope", LongType(),True), \
                      StructField("ca", LongType(), True), \
                      StructField("thal", LongType(), True), \
                      StructField("target", LongType(), True), \
                        ])

In [9]:
heart.toPandas().head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [10]:
from pyspark.ml import Pipeline
from pyspark.sql.types import StructType,StructField,LongType, StringType,DoubleType,TimestampType


df = heart.withColumnRenamed("target","label")
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- sex: integer (nullable = true)
 |-- cp: integer (nullable = true)
 |-- trestbps: integer (nullable = true)
 |-- chol: integer (nullable = true)
 |-- fbs: integer (nullable = true)
 |-- restecg: integer (nullable = true)
 |-- thalach: integer (nullable = true)
 |-- exang: integer (nullable = true)
 |-- oldpeak: double (nullable = true)
 |-- slope: integer (nullable = true)
 |-- ca: integer (nullable = true)
 |-- thal: integer (nullable = true)
 |-- label: integer (nullable = true)



In [11]:
testDF, trainDF = df.randomSplit([0.3, 0.7])

Ejercicio 2.- En este ejercicio vas a aplicar un pre-procesamiento mínimo en el conjunto de datos del ejercicio anterior, de manera que puedas utilizarlos para entrenar un modelo

In [12]:
from pyspark.ml.feature import VectorAssembler

In [13]:
heart.columns


['age',
 'sex',
 'cp',
 'trestbps',
 'chol',
 'fbs',
 'restecg',
 'thalach',
 'exang',
 'oldpeak',
 'slope',
 'ca',
 'thal',
 'target']

In [14]:
assembler = VectorAssembler(
                            inputCols=['age',
                            'sex',
                            'cp',
                            'trestbps',
                            'chol',
                            'fbs',
                            'restecg',
                            'thalach',
                            'exang',
                            'oldpeak',
                            'slope',
                            'ca',
                            'thal'],
                            outputCol="features")

In [15]:
output = assembler.transform(heart)

In [16]:
output.show(5)

+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+--------------------+
|age|sex| cp|trestbps|chol|fbs|restecg|thalach|exang|oldpeak|slope| ca|thal|target|            features|
+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+--------------------+
| 63|  1|  3|     145| 233|  1|      0|    150|    0|    2.3|    0|  0|   1|     1|[63.0,1.0,3.0,145...|
| 37|  1|  2|     130| 250|  0|      1|    187|    0|    3.5|    0|  0|   2|     1|[37.0,1.0,2.0,130...|
| 41|  0|  1|     130| 204|  0|      0|    172|    0|    1.4|    2|  0|   2|     1|[41.0,0.0,1.0,130...|
| 56|  1|  1|     120| 236|  0|      1|    178|    0|    0.8|    2|  0|   2|     1|[56.0,1.0,1.0,120...|
| 57|  0|  0|     120| 354|  0|      1|    163|    1|    0.6|    2|  0|   2|     1|[57.0,0.0,0.0,120...|
+---+---+---+--------+----+---+-------+-------+-----+-------+-----+---+----+------+--------------------+
only showing top 5 rows



In [17]:
final_data = output.select("features",'target')

Ejercicio 3.- En este ejercicio vas a entrenar un modelo de clasificación binaria con la librería de machine learning de PySpark, con el conjunto de datos pre-procesados del ejercicio anterior. Una vez entrenado el modelo, vas a tener que realizar una predicción con el conjunto de datos de test y comparar los resultados. A continuación, deberás obtener diferentes métricas de evaluación para determinar si el modelo es adecuado o no.

In [18]:
train, test = final_data.randomSplit([0.7,0.3])

In [19]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="target",
                        featuresCol="features")

In [20]:
model = lr.fit(train)

In [21]:
predict_train=model.transform(train)
predict_test=model.transform(test)
predict_test.select("target","prediction").show(10)

+------+----------+
|target|prediction|
+------+----------+
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
+------+----------+
only showing top 10 rows



In [22]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction',
                                          labelCol='target')

predict_test.select("target","rawPrediction","prediction","probability").show(5)


+------+--------------------+----------+--------------------+
|target|       rawPrediction|prediction|         probability|
+------+--------------------+----------+--------------------+
|     1|[-1.8689707409671...|       1.0|[0.13366086130424...|
|     1|[-3.6139795363243...|       1.0|[0.02623745433363...|
|     1|[-2.9783618159622...|       1.0|[0.04841304277397...|
|     1|[-1.2285248473463...|       1.0|[0.22643971622931...|
|     1|[-0.3193046060832...|       1.0|[0.42084523001965...|
+------+--------------------+----------+--------------------+
only showing top 5 rows



In [23]:
print("The area under ROC for train set is {}".format(evaluator.evaluate(predict_train)))

print("The area under ROC for test set is {}".format(evaluator.evaluate(predict_test)))

The area under ROC for train set is 0.9213075060532689
The area under ROC for test set is 0.901595744680851
