### PROYECTO OPEN DATA II: MEJORAS AL MODELO
En esta segunda parte de la entrega, le aplicamos mejoras al modelo trabajado en la primera parte. 
Técnicas empleadas:
* Extracción de características y PCA
* Hyper-tunning de parámentros
* Grid search

Procedemos con los mismos pasos que en la primera parte de implementación del algoritmo de regresión lineal.

In [5]:
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext.getOrCreate();
sqlContext = SQLContext(sc)

In [6]:
df = sqlContext.read.format("com.databricks.spark.csv").options(header='true',inferschema='true').load("Admission_Predict.csv")
display(df)

DataFrame[Serial No.: int, GRE Score: int, TOEFL Score: int, University Rating: int, SOP: double, LOR : double, CGPA: double, Research: int, Chance of Admit : double]

In [7]:
df = df.withColumnRenamed("Serial No.", "Serial No")

In [8]:
df.printSchema()

root
 |-- Serial No: integer (nullable = true)
 |-- GRE Score: integer (nullable = true)
 |-- TOEFL Score: integer (nullable = true)
 |-- University Rating: integer (nullable = true)
 |-- SOP: double (nullable = true)
 |-- LOR : double (nullable = true)
 |-- CGPA: double (nullable = true)
 |-- Research: integer (nullable = true)
 |-- Chance of Admit : double (nullable = true)



In [9]:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

In [10]:
def transData(data):
    return data.rdd.map(lambda r: [Vectors.dense(r[:-1]),r[-1]]).toDF(['features','label'])

In [11]:
transformed= transData(df)
transformed.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[1.0,337.0,118.0,...| 0.92|
|[2.0,324.0,107.0,...| 0.76|
|[3.0,316.0,104.0,...| 0.72|
|[4.0,322.0,110.0,...|  0.8|
|[5.0,314.0,103.0,...| 0.65|
+--------------------+-----+
only showing top 5 rows



**Implementación del algorítmo PCA**

In [12]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression 
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorIndexer

In [13]:
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

#df = spark.createDataFrame(data,["features"])
model = PCA(k=2, inputCol="features", outputCol="pca_features").fit(transformed)
data = model.transform(transformed)



In [14]:
data.show(5,True)

+--------------------+-----+--------------------+
|            features|label|        pca_features|
+--------------------+-----+--------------------+
|[1.0,337.0,118.0,...| 0.92|[3.24875991995343...|
|[2.0,324.0,107.0,...| 0.76|[2.03427489018224...|
|[3.0,316.0,104.0,...| 0.72|[0.92834004552377...|
|[4.0,322.0,110.0,...|  0.8|[0.03445640395570...|
|[5.0,314.0,103.0,...| 0.65|[-1.1026673565109...|
+--------------------+-----+--------------------+
only showing top 5 rows



In [15]:
 # Split the data into training and test sets (40% held out for testing)
(trainingData, testData) = transformed.randomSplit([0.6, 0.4])

In [16]:
trainingData.show(5)
testData.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[1.0,337.0,118.0,...| 0.92|
|[5.0,314.0,103.0,...| 0.65|
|[9.0,302.0,102.0,...|  0.5|
|[12.0,327.0,111.0...| 0.84|
|[13.0,328.0,112.0...| 0.78|
+--------------------+-----+
only showing top 5 rows

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[2.0,324.0,107.0,...| 0.76|
|[3.0,316.0,104.0,...| 0.72|
|[4.0,322.0,110.0,...|  0.8|
|[6.0,330.0,115.0,...|  0.9|
|[7.0,321.0,109.0,...| 0.75|
+--------------------+-----+
only showing top 5 rows



In [17]:
# Import LinearRegression class
from pyspark.ml.regression import LinearRegression
# Define LinearRegression algorithm
lr = LinearRegression(maxIter=10, regParam=0.01, elasticNetParam=0.8)

In [18]:
pipeline = Pipeline(stages=[model, lr])
model = pipeline.fit(trainingData)

In [19]:
def modelsummary(model):
    import numpy as np
    Summary=model.summary
    print ("##",'---')
    print ("##","Mean squared error: % .6f" \
           % Summary.meanSquaredError, ", RMSE: % .6f" \
           % Summary.rootMeanSquaredError )
    print ("##","Multiple R-squared: %f" % Summary.r2, ", \
    Total iterations: %i"% Summary.totalIterations)

In [20]:
modelsummary(model.stages[-1])

## ---
## Mean squared error:  0.003143 , RMSE:  0.056062
## Multiple R-squared: 0.848034 ,     Total iterations: 11


In [21]:
# Make predictions.
predictions = model.transform(testData)

In [30]:
# Select example rows to display.
predictions.select("features","label","prediction").show(5)

+--------------------+-----+------------------+
|            features|label|        prediction|
+--------------------+-----+------------------+
|[1.0,337.0,118.0,...| 0.92|0.9161729400221594|
|[2.0,324.0,107.0,...| 0.76|0.7786370965160796|
|[5.0,314.0,103.0,...| 0.65|0.6251059807902062|
|[7.0,321.0,109.0,...| 0.75|0.6943647915557778|
|[9.0,302.0,102.0,...|  0.5| 0.549662077885811|
+--------------------+-----+------------------+
only showing top 5 rows



In [31]:
from pyspark.ml.evaluation import RegressionEvaluator
# Select (prediction, true label) and compute test error 
evaluator = RegressionEvaluator(labelCol="label",
                                predictionCol="prediction",
                                metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 0.061483


In [32]:
#You can also check the 𝑅2 value for the test data:
y_true = predictions.select("label").toPandas()
y_pred = predictions.select("prediction").toPandas()


In [33]:
import sklearn.metrics
r2_score = sklearn.metrics.r2_score(y_true, y_pred) 
print('r2_score: {0}'.format(r2_score))

r2_score: 0.8092971528076566


##### Mejora del Modelo

In [34]:
# Hyper-tuning

In [35]:
from pyspark.ml.regression import LinearRegression

In [36]:
import pyspark.ml.tuning as tune
import pyspark.ml.classification as cl
import pyspark.ml.evaluation as ev
from pyspark.ml import Pipeline

linear = LinearRegression(labelCol='label',featuresCol = 'pca_features')
grid = tune.ParamGridBuilder().addGrid(linear.maxIter, [2, 9, 30]).addGrid(linear.regParam, [0.01, 0.05, 0.3]).addGrid(linear.elasticNetParam, [0.09, 0.04, 0.8]).build()

evaluator = ev.BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label')

In [37]:
cv = tune.CrossValidator(estimator=linear, estimatorParamMaps=grid, evaluator=evaluator)

In [38]:
pipeline = Pipeline(stages=[model])
model = pipeline.fit(trainingData)



In [39]:
cvModel = cv.fit(model.transform(trainingData))

In [40]:
data_train = model.transform(testData)
results = cvModel.transform(data_train)

print(evaluator.evaluate(results, {evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(results, {evaluator.metricName: 'areaUnderPR'}))

0.869215291750503
0.9853605463577588


In [41]:
results = [
    (
        [
            {key.name: paramValue} 
            for key, paramValue 
            in zip(
                params.keys(), 
                params.values())
        ], metric
    ) 
    for params, metric 
    in zip(
        cvModel.getEstimatorParamMaps(), 
        cvModel.avgMetrics
    )
]

sorted(results, key=lambda el: el[1], reverse=True)[0]

([{'maxIter': 9}, {'regParam': 0.01}, {'elasticNetParam': 0.09}],
 0.9321698732735655)

In [42]:
print ('Best Param (MaxIter): ', cvModel.bestModel._java_obj.getMaxIter())

Best Param (MaxIter):  9


In [43]:
print ('Best Param (RegParam): ', cvModel.bestModel._java_obj.getRegParam())


Best Param (RegParam):  0.01


In [44]:
print ('Best Param (ElasticNetParam): ', cvModel.bestModel._java_obj.getElasticNetParam())



Best Param (ElasticNetParam):  0.09


**Extracción de caracteristicas y PCA (reducción de dimensinalidad)**

PCA es una técnica que trata de reducir el número de dimensiones (variables) de un conjunto de datos intentando, a su vez, conservar la mayor cantidad de información. Es una técnica extremadamente útil en el análisis exploratorio de datos cuando se tiene demasiada información (muchas dimensiones, variables) y no se puede analizar correctamente la información.

Para seleccionar las variables más relevante a la hora de hacer predicciones utilizamos el selector Chi-Cuadrado.

En nuestro caso, vamos extraer las dos variables que más peso tengan en cuanto a utilidad de la información.

In [46]:
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

selector = ChiSqSelector(numTopFeatures=2, featuresCol="features",
                             outputCol="selectedFeatures", labelCol="label")

result = selector.fit(data).transform(data)

print("ChiSqSelector output with top %d features selected" % selector.getNumTopFeatures())
result.show()

ChiSqSelector output with top 2 features selected
+--------------------+-----+--------------------+----------------+
|            features|label|        pca_features|selectedFeatures|
+--------------------+-----+--------------------+----------------+
|[1.0,337.0,118.0,...| 0.92|[3.24875991995343...|   [337.0,118.0]|
|[2.0,324.0,107.0,...| 0.76|[2.03427489018224...|   [324.0,107.0]|
|[3.0,316.0,104.0,...| 0.72|[0.92834004552377...|   [316.0,104.0]|
|[4.0,322.0,110.0,...|  0.8|[0.03445640395570...|   [322.0,110.0]|
|[5.0,314.0,103.0,...| 0.65|[-1.1026673565109...|   [314.0,103.0]|
|[6.0,330.0,115.0,...|  0.9|[-1.8424081105850...|   [330.0,115.0]|
|[7.0,321.0,109.0,...| 0.75|[-2.9827444301870...|   [321.0,109.0]|
|[8.0,308.0,101.0,...| 0.68|[-4.1748594605740...|   [308.0,101.0]|
|[9.0,302.0,102.0,...|  0.5|[-5.2306601034426...|   [302.0,102.0]|
|[10.0,323.0,108.0...| 0.45|[-5.9708511040333...|   [323.0,108.0]|
|[11.0,325.0,106.0...| 0.52|[-6.9658918293670...|   [325.0,106.0]|
|[12.0,327.0

Obtenemos que los GRE y TOEFL scores son los más importantes respectivamente.