### Prueba 2

    Yelp es un directorio de servicios a nivel mundial, que permite a sus usuarios el evaluar los servicios
    (restaurants, bancos, clínicas, gimnasios, entre otros) para encontrar y sugerir mejores servicios.
    Para esta prueba utilizaremos los datos disponibilizados por Yelp para:
    Identificar usuarios molestosos.
    Probabilidad de cierre de los negocios.

In [1]:
from pyspark.sql.types import StructType, StructField, StringType, FloatType, IntegerType, DoubleType, LongType
import pyspark.sql.functions as f
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import LogisticRegression, GBTClassifier, DecisionTreeClassifier
import pandas as pd
from pyspark.sql.functions import when,col
from pyspark.sql import Row
from pyspark.sql.functions import monotonically_increasing_id

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
4,application_1576966054748_0005,pyspark,idle,,,✔


SparkSession available as 'spark'.


In [2]:
userdf = spark.read.json("s3://bigdata-desafio/yelp-data/user.json")

In [3]:
userdf.printSchema()

root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- fans: long (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)

Identifique en una variable dummy todos los usuarios que se puedan clasificar como molestosos
acorde al criterio.

Recodificaciones en el archivo user.json :


    friends , que corresponde a un string con todos los user_id de otros usuarios que
    siguen al usuario . El objetivo es contar la cantidad de amigos existentes.
    
    elite , que corresponde a un string con todos los años en los que el usuario fue
    considerado como un reviewer de elite. El objetivo es contar la cantidad de años en los
    cuales se consideró como elite.
    
    Asegúrese de eliminar los siguientes registros: friends , yelping_since , name ,
    elite , user_id .

In [4]:
schema = StructType([
    StructField('average_stars', DoubleType(), True),
    StructField('compliment_cool', LongType(), True),
    StructField('compliment_cute', LongType(), nullable= True),
    StructField('compliment_funny', LongType(), nullable= True),
    StructField('compliment_hot', LongType(), nullable= True),
    StructField('compliment_list', LongType(), nullable= True),
    StructField('compliment_more', LongType(), nullable= True),
    StructField('compliment_note', LongType(), nullable= True),
    StructField('compliment_photos', LongType(), nullable= True),
    StructField('compliment_plain', LongType(), nullable= True),
    StructField('compliment_profile', LongType(), nullable= True),
    StructField('compliment_writer', LongType(), nullable= True),
    StructField('cool', LongType(), nullable= True),
    StructField('elite', StringType(), nullable= True),
    StructField('fans', LongType(), nullable= True),
    StructField('friends', StringType(), nullable= True),
    StructField('funny',LongType(), nullable= True),
    StructField('name', StringType(), nullable= True),
    StructField('review_count', LongType(), nullable= True),
    StructField('useful', LongType(), nullable= True),
    StructField('user_id', StringType(), nullable= True),
    StructField('yelping_since', StringType(), nullable= True),
    StructField('molesto', IntegerType(), nullable= True),
    StructField('friend_count', LongType(), nullable= True),
    StructField('elite_count', LongType(), nullable= True)
])

In [5]:
userdf = spark.read.json("s3://bigdata-desafio/yelp-data/user.json", schema=schema)

In [6]:
userdf.printSchema()

root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- fans: long (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)
 |-- molesto: integer (nullable = true)
 |-- friend_count: long (nulla

In [7]:
userdf.registerTempTable('userdf')
query = "select * from userdf where average_stars <= 2 or review_count < 100 or fans=0"

In [8]:
userdf = userdf\
    .withColumn('molesto', when((userdf['average_stars'] <= 2) & (userdf['review_count'] < 100) & (userdf['fans'] == 0), 1)\
    .otherwise(0))

In [9]:
userdf.select('molesto').show()

+-------+
|molesto|
+-------+
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
|      0|
+-------+
only showing top 20 rows

In [10]:
userdf = userdf\
    .withColumn('friend_count', userdf['friends'])

In [11]:
userdf = userdf.withColumn('friend_count', f.size(f.split(f.col('friends'), ',')))

In [12]:
userdf.select('friend_count').show()

+------------+
|friend_count|
+------------+
|          99|
|        1152|
|          15|
|         525|
|         231|
|        5450|
|        4326|
|        1193|
|         382|
|         898|
|         194|
|          83|
|         582|
|          25|
|         248|
|         367|
|         286|
|         258|
|        3451|
|          46|
+------------+
only showing top 20 rows

In [13]:
userdf = userdf.withColumn('elite_count', f.size(f.split(f.col('elite'), ',')))

In [14]:
userdf.select('elite_count').show()

+-----------+
|elite_count|
+-----------+
|          3|
|          1|
|          1|
|          1|
|          4|
|          4|
|          8|
|          1|
|          7|
|          1|
|          2|
|          1|
|          1|
|          1|
|          1|
|          1|
|          6|
|          1|
|          5|
|          1|
+-----------+
only showing top 20 rows

In [15]:
userdf.printSchema()

root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- fans: long (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)
 |-- molesto: integer (nullable = false)
 |-- friend_count: integer (n

In [16]:
columns_to_drop = ['friends', 'elite', 'yelping_since', 'user_id', 'name']
userdf = userdf.drop(*columns_to_drop)

In [17]:
userdf.printSchema()

root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- fans: long (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- molesto: integer (nullable = false)
 |-- friend_count: integer (nullable = false)
 |-- elite_count: integer (nullable = false)

In [18]:
userdf = userdf.withColumnRenamed('molesto','label')
userdf.printSchema()

root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- fans: long (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- label: integer (nullable = false)
 |-- friend_count: integer (nullable = false)
 |-- elite_count: integer (nullable = false)

    Genere la medición de usuarios molestos en base a los criterios expuestos. (cuyo nombre ahora es label)

In [51]:
userdf.groupBy('label').count().sort('count').show()

+-----+-------+
|label|  count|
+-----+-------+
|    1| 183676|
|    0|1453462|
+-----+-------+

#### Existen 183676 usuarios molestos

    Divida la muestra en conjuntos de entrenamiento (preservando un 70% de los datos) y validación (preservando un 30% de los datos). (1 punto)

In [19]:
feats = userdf.columns
feats.remove('label')

In [20]:
assemble_feats = VectorAssembler(inputCols = feats, outputCol = 'assembled_features')
assemble_feats = assemble_feats.transform(userdf)
assemble_feats = assemble_feats.select(['label', 'assembled_features'])

In [21]:
train, test = assemble_feats.randomSplit([0.7, 0.3])

    Entrene tres modelos ( LogisticRegression , GBTClassifier y DecisionTreeClassifier )
    sin modificar hiperparámetros que en base a los atributos disponibles en el archivo user.json ,
    clasifique los usuarios molestosos. (6 puntos)

In [22]:
logistic_example = LogisticRegression(featuresCol='assembled_features', labelCol='label', predictionCol='molesto_pred')


In [23]:
logistic_example = logistic_example.fit(train)
logistic_example.transform(test).show(10)

+-----+--------------------+--------------------+--------------------+------------+
|label|  assembled_features|       rawPrediction|         probability|molesto_pred|
+-----+--------------------+--------------------+--------------------+------------+
|    0|(19,[0,1,2,3,12,1...|[690.858194212917...|[1.0,9.2065830610...|         0.0|
|    0|(19,[0,1,3,4,6,12...|[257.164237961607...|[1.0,2.0653350265...|         0.0|
|    0|(19,[0,1,3,4,7,12...|[566.363045385871...|[1.0,1.0756095139...|         0.0|
|    0|(19,[0,1,3,4,9,12...|[270.427238866619...|[1.0,3.5887433759...|         0.0|
|    0|(19,[0,1,3,4,9,12...|[298.121538242307...|[1.0,3.3686896704...|         0.0|
|    0|(19,[0,1,3,4,9,12...|[304.045316154618...|[1.0,9.0114970635...|         0.0|
|    0|(19,[0,1,3,4,12,1...|[402.641055916872...|[1.0,1.3652469896...|         0.0|
|    0|(19,[0,1,3,4,12,1...|[886.913592839318...|           [1.0,0.0]|         0.0|
|    0|(19,[0,1,3,4,12,1...|[3780.94029344579...|           [1.0,0.0]|      

In [24]:
logistic_example.transform(test).show(10)

+-----+--------------------+--------------------+--------------------+------------+
|label|  assembled_features|       rawPrediction|         probability|molesto_pred|
+-----+--------------------+--------------------+--------------------+------------+
|    0|(19,[0,1,2,3,12,1...|[690.858194212917...|[1.0,9.2065830610...|         0.0|
|    0|(19,[0,1,3,4,6,12...|[257.164237961607...|[1.0,2.0653350265...|         0.0|
|    0|(19,[0,1,3,4,7,12...|[566.363045385871...|[1.0,1.0756095139...|         0.0|
|    0|(19,[0,1,3,4,9,12...|[270.427238866619...|[1.0,3.5887433759...|         0.0|
|    0|(19,[0,1,3,4,9,12...|[298.121538242307...|[1.0,3.3686896704...|         0.0|
|    0|(19,[0,1,3,4,9,12...|[304.045316154618...|[1.0,9.0114970635...|         0.0|
|    0|(19,[0,1,3,4,12,1...|[402.641055916872...|[1.0,1.3652469896...|         0.0|
|    0|(19,[0,1,3,4,12,1...|[886.913592839318...|           [1.0,0.0]|         0.0|
|    0|(19,[0,1,3,4,12,1...|[3780.94029344579...|           [1.0,0.0]|      

In [25]:
GBT = GBTClassifier(featuresCol='assembled_features', labelCol='label', predictionCol='molesto_pred')
GBTf = GBT.fit(train)
GBTt = GBTf.transform(test)

In [26]:
evaluator = BinaryClassificationEvaluator()

In [27]:
DTreeC = DecisionTreeClassifier(featuresCol='assembled_features', labelCol='label', predictionCol='molesto_pred')
DTreeC = DTreeC.fit(train)
DTreeC = DTreeC.transform(test)

In [28]:
print('Curva ROC para Logistic regresion',logistic_example.evaluate(test).areaUnderROC)
print('Curva ROC para GBTClasifier',evaluator.evaluate(GBTt, {evaluator.metricName : "areaUnderROC"}))
print('Curva ROC para DecisionTreeClassifier',evaluator.evaluate(DTreeC, {evaluator.metricName : "areaUnderROC"}))

('Curva ROC para Logistic regresion', 0.9999726095434222)
('Curva ROC para GBTClasifier', 0.999989511386619)
('Curva ROC para DecisionTreeClassifier', 0.9993556392481716)

    Reporte cuál es el mejor modelo en base a la métrica AUC. (1 punto)

El que ofrece mejor AUC es GBT

    Identifique cuales son los principales atributos asociados a un usuario molestoso y repórtelos. (2
    puntos)

In [29]:
feat_importance = pd.DataFrame({'col':feats, 'importance': GBTf.featureImportances.toArray()})
feat_importance.sort_values(by='importance', ascending=False)

                   col    importance
0        average_stars  9.085699e-01
13                fans  5.366604e-02
15        review_count  3.767968e-02
11   compliment_writer  3.394893e-05
17        friend_count  2.928648e-05
6      compliment_more  1.958342e-05
12                cool  1.576030e-06
7      compliment_note  1.263573e-09
1      compliment_cool  4.202182e-11
5      compliment_list  9.283319e-13
9     compliment_plain  1.236168e-13
4       compliment_hot  7.819918e-14
16              useful  2.029635e-14
2      compliment_cute  2.470726e-17
8    compliment_photos  5.570409e-18
14               funny  1.180731e-18
18         elite_count  6.603813e-19
10  compliment_profile  0.000000e+00
3     compliment_funny  0.000000e+00

#### Por ende los atributos de mayor importancia son Average_stars, fans y review_count

## Conclusiones:
Se crea un modelo muy potente y causa impresion lo eficaz y util que puede resultar esta tecnologia

# Ejercicio 2: Identificando la probabilidad de cierre de un servicio (14 puntos)

In [30]:
df = spark.read.json("s3://bigdata-desafio/yelp-data/business.json")

In [31]:
df.printSchema()

root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: string (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: string (nullable = true)
 |    |-- BYOB: string (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: string (nullable = true)
 |    |-- BikeParking: string (nullable = true)
 |    |-- BusinessAcceptsBitcoin: string (nullable = true)
 |    |-- BusinessAcceptsCreditCards: string (nullable = true)
 |    |-- BusinessParking: string (nullable = true)
 |    |-- ByAppointmentOnly: string (nullable = true)
 |    |-- Caters: string (nullable = true)
 |    |-- CoatCheck: string (nullable = true)
 |    |-- Corkage: string (nullable = true)
 |    |-- DietaryRestrictions: string (nullable = true)
 |    |-- DogsAllowed: string (nullable = true)
 |    |-- DriveThru: string (nullable = true)
 |    |-- GoodForDancing: str

    Implemente el esquema de recodificación. (2 puntos)

In [32]:
#Objetivo = df.select('is_open')

# Aqui falla no cacho si solo me pedian esos, o los estoy interpretando mal

In [33]:
df_proc = df.select('review_count',
          'stars',
          'is_open',
           df.attributes.getField('AcceptsInsurance'),
           df.attributes.getField('AgesAllowed'),
           df.attributes.getField('Alcohol'),
           df.attributes.getField('BusinessAcceptsBitcoin'),
           df.attributes.getField('GoodForMeal'),
           df.attributes.getField('BusinessAcceptsCreditCards'),
           df.attributes.getField('GoodForDancing'),
          df.attributes.getField('Smoking'),
          df.attributes.getField('WiFi'),
          df.attributes.getField('HasTV'),
          df.attributes.getField('DogsAllowed'),
          df.attributes.getField('GoodForKids'),
          df.attributes.getField('RestaurantsPriceRange2'),
          df.attributes.getField('NoiseLevel'),
          df.attributes.getField('HappyHour'),
         )

In [34]:
df_proc.columns

['review_count', 'stars', 'is_open', 'attributes.AcceptsInsurance', 'attributes.AgesAllowed', 'attributes.Alcohol', 'attributes.BusinessAcceptsBitcoin', 'attributes.GoodForMeal', 'attributes.BusinessAcceptsCreditCards', 'attributes.GoodForDancing', 'attributes.Smoking', 'attributes.WiFi', 'attributes.HasTV', 'attributes.DogsAllowed', 'attributes.GoodForKids', 'attributes.RestaurantsPriceRange2', 'attributes.NoiseLevel', 'attributes.HappyHour']

In [35]:
def replace_none_with_0(r):
    return Row(**{k: 0 if v == None else v for k, v in r.asDict().iteritems()})
def replace_false_with_0(r):
    return Row(**{k: 0 if v == 'False' else v for k, v in r.asDict().iteritems()})
def replace_null_with_0(r):
    return Row(**{k: 0 if v == 'null' else v for k, v in r.asDict().iteritems()})

In [36]:
df_proc = df_proc.rdd.map(lambda x: replace_none_with_0(x)).toDF()

In [37]:
df_proc.columns

['attributes.AcceptsInsurance', 'attributes.AgesAllowed', 'attributes.Alcohol', 'attributes.BusinessAcceptsBitcoin', 'attributes.BusinessAcceptsCreditCards', 'attributes.DogsAllowed', 'attributes.GoodForDancing', 'attributes.GoodForKids', 'attributes.GoodForMeal', 'attributes.HappyHour', 'attributes.HasTV', 'attributes.NoiseLevel', 'attributes.RestaurantsPriceRange2', 'attributes.Smoking', 'attributes.WiFi', 'is_open', 'review_count', 'stars']

In [38]:
df_proc = df_proc.rdd.map(lambda x: replace_false_with_0(x)).toDF()

In [39]:
df_proc = df_proc.withColumnRenamed("attributes.AcceptsInsurance", "accepts_insurance")
df_proc = df_proc.withColumnRenamed("attributes.AgesAllowed", "all_ages_allowed")
df_proc = df_proc.withColumnRenamed("attributes.Alcohol", "alcohol_consumption")
df_proc = df_proc.withColumnRenamed("attributes.BusinessAcceptsBitcoin", "bitcoin_friendly")
df_proc = df_proc.withColumnRenamed("attributes.GoodForMeal", "food_related")
df_proc = df_proc.withColumnRenamed("attributes.BusinessAcceptsCreditCards", "finance_related")
df_proc = df_proc.withColumnRenamed("attributes.GoodForDancing", "health_related")
df_proc = df_proc.withColumnRenamed("attributes.Smoking", "smoking")
df_proc = df_proc.withColumnRenamed("attributes.WiFi", "free_wifi")
df_proc = df_proc.withColumnRenamed("is_open", "label")
df_proc = df_proc.withColumnRenamed("attributes.HasTV", "has_tv")
df_proc = df_proc.withColumnRenamed("attributes.DogsAllowed", "dog_friendly")
df_proc = df_proc.withColumnRenamed("attributes.GoodForKids", "kid_friendly")
df_proc = df_proc.withColumnRenamed("attributes.RestaurantsPriceRange2", "expensive_restaurant")
df_proc = df_proc.withColumnRenamed("attributes.NoiseLevel", "loud_place")
df_proc = df_proc.withColumnRenamed("attributes.HappyHour", "happy_hour")

In [40]:
df_proc=df_proc.na.drop()

In [41]:
df_proc.take(1)

[Row(accepts_insurance=0, all_ages_allowed=0, alcohol_consumption=0, bitcoin_friendly=0, finance_related=0, dog_friendly=0, health_related=0, kid_friendly=0, food_related=0, happy_hour=0, has_tv=0, loud_place=0, expensive_restaurant=0, smoking=0, free_wifi=0, label=0, review_count=5, stars=3.0)]

In [42]:
#Objetivo.select("is_open").distinct().show()

In [43]:
feats = df_proc.columns
feats.remove('label')

# Aqui falla, no cacho como recodificar el objetivo
    Genere la recodificación del vector objetivo. (2 puntos)

In [44]:
df_proc = df_proc\
    .withColumn('label', when(df_proc['label'] == 0, 1)\
    .otherwise(0))

In [45]:
assemble_feats = VectorAssembler(inputCols = feats, outputCol = 'assembled_features')
assemble_feats = assemble_feats.transform(df_proc)
assemble_feats = assemble_feats.select(['label', 'assembled_features'])

    Divida la muestra en conjuntos de entrenamiento (preservando un 70% de los datos) y
    validación (preservando un 30% de los datos). (1 punto)

In [46]:
train, test = assemble_feats.randomSplit([0.7, 0.3])

    Entrene tres modelos ( LogisticRegression , GBTClassifier y DecisionTreeClassifier )
    sin modificar hiperparámetros que en base a los atributos recodificados del archivo
    review.json , clasifique aquellos servicios cerrados. (6 puntos)

In [47]:
logistic = LogisticRegression(featuresCol='assembled_features', labelCol='label', predictionCol='label_pred')
logisticf = logistic.fit(train)
logistict = logisticf.transform(test)
GBT = GBTClassifier(featuresCol='assembled_features', labelCol='label', predictionCol='label_pred')
GBTf = GBT.fit(train)
GBTt = GBTf.transform(test)
evaluator = BinaryClassificationEvaluator()
DTreeC = DecisionTreeClassifier(featuresCol='assembled_features', labelCol='label', predictionCol='label_pred')
DTreeC = DTreeC.fit(train)
DTreeC = DTreeC.transform(test)


In [48]:
print('Curva ROC para GBTClasifier',evaluator.evaluate(logistict, {evaluator.metricName : "areaUnderROC"}))
print('Curva ROC para GBTClasifier',evaluator.evaluate(GBTt, {evaluator.metricName : "areaUnderROC"}))
print('Curva ROC para DecisionTreeClassifier',evaluator.evaluate(DTreeC, {evaluator.metricName : "areaUnderROC"}))

('Curva ROC para GBTClasifier', 0.4942262611061267)
('Curva ROC para GBTClasifier', 0.5170462767342922)
('Curva ROC para DecisionTreeClassifier', 0.4984371834572975)

In [49]:
feat_importance = pd.DataFrame({'col':feats, 'importance': GBTf.featureImportances.toArray()})
feat_importance.sort_values(by='importance', ascending=False)

                     col  importance
15          review_count    0.569291
16                 stars    0.430709
9             happy_hour    0.000000
14             free_wifi    0.000000
13               smoking    0.000000
12  expensive_restaurant    0.000000
11            loud_place    0.000000
10                has_tv    0.000000
0      accepts_insurance    0.000000
1       all_ages_allowed    0.000000
7           kid_friendly    0.000000
6         health_related    0.000000
5           dog_friendly    0.000000
4        finance_related    0.000000
3       bitcoin_friendly    0.000000
2    alcohol_consumption    0.000000
8           food_related    0.000000

## Conclusiones:
Los atributos mas importantes son la cantidad de estrellas y las review, se sospecha el modelo tiene fallas debido a que no se especifica en la prueba misma cual es la manera de recodificar is_open, ademas que no existe el supuesto archivo .py que se dice existir en el enunciado, pero un punto aparte es recalcar lo util que puede resultar spark y ademas que cada vez es mas buscado y reconocido en la industria informatica