# Credit Card Default
#### Josep Puig Sallés

I will try to predict which clients will default. I have used the **Credit Card DataSet** and I've done it with **pyspark**. I try to solve it by training several machine learning techniques and show how they behave within the sample and out of the sample train. 

There are some techniques that are not thought to use in this kind of dataset, I only want to prove some models for classifying.

This is my first post in Github and moreover, I have learned spark in the last two months so, please, if you see any mistake or anything you think that could be better, write a comment.

This post has to parts, the first one, I use *spark SQL* to do some exploratory analysis. And, in the second part, some Machine Learning models are used to predict the default clients. 

The problem with spark is that I cannot make plots to show the results like in R or python. 

The first thing to do is to establish the spark environment:


In [1]:
import os
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python3'
import pyspark
conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)

The second thing is to download the data and quit the header:

In [2]:
d0 = sc.textFile('./UCI_Credit_Card.csv')
header = d0.first()
d1 = d0.filter(lambda line: line != header)
d2 = d1.map(lambda x: x.split(','))
d2.take(7)

[['1',
  '20000',
  '2',
  '2',
  '1',
  '24',
  '2',
  '2',
  '-1',
  '-1',
  '-2',
  '-2',
  '3913',
  '3102',
  '689',
  '0',
  '0',
  '0',
  '0',
  '689',
  '0',
  '0',
  '0',
  '0',
  '1'],
 ['2',
  '120000',
  '2',
  '2',
  '2',
  '26',
  '-1',
  '2',
  '0',
  '0',
  '0',
  '2',
  '2682',
  '1725',
  '2682',
  '3272',
  '3455',
  '3261',
  '0',
  '1000',
  '1000',
  '1000',
  '0',
  '2000',
  '1'],
 ['3',
  '90000',
  '2',
  '2',
  '2',
  '34',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '29239',
  '14027',
  '13559',
  '14331',
  '14948',
  '15549',
  '1518',
  '1500',
  '1000',
  '1000',
  '1000',
  '5000',
  '0'],
 ['4',
  '50000',
  '2',
  '2',
  '1',
  '37',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '46990',
  '48233',
  '49291',
  '28314',
  '28959',
  '29547',
  '2000',
  '2019',
  '1200',
  '1100',
  '1069',
  '1000',
  '0'],
 ['5',
  '50000',
  '1',
  '2',
  '1',
  '57',
  '-1',
  '0',
  '-1',
  '0',
  '0',
  '0',
  '8617',
  '5670',
  '35835',
  '20940',
  '19146',
  

I define some functions to explore the data: Gender, Education and Marriage:

In [3]:
def Gender(x):
    if int(x) == 2:
        return 'Female'
    else: return 'Male'
def Education(x):
    if int(x) == 1:
        return 'GraduateSchool'
    elif int(x) ==2:
        return 'University'
    elif int(x) == 3:
        return 'HighSchool'
    else: return 'Unknown'

def Marriage(x):
    if int(x) == 1:
        return 'Married'
    elif int(x) ==2:
        return 'Single'
    else: return 'Others'

def Default(x):
    if int(x)==1:
        return 'Yes'
    else: return 'No'

In [4]:
from pyspark.sql import Row

datosExp = d2.map(lambda x: Row(
        Default = Default(x[24]),
        Limit_Bal = float(x[1]),
        sex = Gender(x[2]),
        Education = Education(x[3]),
        Marriage = Marriage(x[4]),
        Age = int(x[5]),
        Pay_0 = float(x[6]),
        Pay_2 = float(x[7]),
        Pay_3 = float(x[8]),
        Pay_4 = float(x[9]),
        Pay_5 = float(x[10]),
        Pay_6 = float(x[11]),
        Bill_amt1 = float(x[12]),
        Bill_amt2 = float(x[13]),
        Bill_amt3 = float(x[14]),
        Bill_amt4 = float(x[15]),
        Bill_amt5 = float(x[16]),
        Bill_amt6 = float(x[17]),
        Pay_amt1 = float(x[18]),
        Pay_amt2 = float(x[19]),
        Pay_amt3 = float(x[20]),
        Pay_amt4 = float(x[21]),
        Pay_amt5 = float(x[22]),
        Pay_amt6 = float(x[23])
        
    ))

We can do some exploration:

In [5]:
import numpy as np
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
datos_df = sqlContext.createDataFrame(datosExp)

In [6]:
datos_df.select("sex", "Default").groupBy("Default", 'sex').count().show()

+-------+------+-----+
|Default|   sex|count|
+-------+------+-----+
|     No|Female|14349|
|     No|  Male| 9015|
|    Yes|  Male| 2873|
|    Yes|Female| 3763|
+-------+------+-----+



In [7]:
datos_df.select("Marriage", "Default").groupBy("Default", 'Marriage').count().show()

+-------+--------+-----+
|Default|Marriage|count|
+-------+--------+-----+
|     No|  Others|  288|
|    Yes|  Others|   89|
|     No|  Single|12623|
|    Yes| Married| 3206|
|    Yes|  Single| 3341|
|     No| Married|10453|
+-------+--------+-----+



In [8]:
datos_df.select("Education", "Default").groupBy("Default", 'Education').count().show()

+-------+--------------+-----+
|Default|     Education|count|
+-------+--------------+-----+
|    Yes|    HighSchool| 1237|
|     No|       Unknown|  435|
|     No|GraduateSchool| 8549|
|    Yes|GraduateSchool| 2036|
|    Yes|       Unknown|   33|
|     No|    University|10700|
|     No|    HighSchool| 3680|
|    Yes|    University| 3330|
+-------+--------------+-----+



It seems likely to SQL code. In fact, we can write SQL commands:

In [9]:
datos_df.registerTempTable("datos")

In [10]:
import pandas as pd
pd.DataFrame(sqlContext.sql("""
        SELECT Education, Count(*) FROM datos GROUP BY Education
                    """).collect())

Unnamed: 0,0,1
0,HighSchool,4917
1,Unknown,468
2,GraduateSchool,10585
3,University,14030


# Models

As I said, I will train some Machine Learning Models. First, I will create all categorical variables in 0 and 1.

In [11]:
def Gender(x):
    if int(x) == 2:
        return 1
    else: return 0

# Variables for Education
    
def GraduateSchool(x):
    if int(x) == 1:
        return 1
    else: return 0
    
def University(x):
    if int(x) == 2:
        return 1
    else: return 0

def HighSchool(x):
    if int(x) == 3:
        return 1
    else: return 0
    
def Unknow(x):
    if int(x) == 4 or int(x) == 5 or int(x) == 6:
        return 1
    else: return 0

    
#Marriage variables

def Married(x):
    if int(x) ==1:
        return 1
    else: return 0

In [12]:
from pyspark.sql import Row
labels = ['Limit_Bal', 'Sex', 'Graduate_School','University','High_School','SchoolUnknown', 'Married',
          'Age','Pay0', 'Pay2', 'Pay3', 'Pay4', 'Pay5', 'Pay6', 'Bill_amt1', 'Bill_amt2', 'Bill_amt3','Bill_amt4',
         'Bill_amt5','Bill_amt6', 'Pay_amt1','Pay_amt2','Pay_amt3','Pay_amt4','Pay_amt5','Pay_amt6']
datosInicio = d2.map(lambda x: Row(
        AAADefault = float(x[24]),
        Limit_Bal = float(x[1]),
        sex = float(Gender(x[2])),
        GraduateSchool = float(GraduateSchool(x[3])),
        University = float(University(x[3])),
        HighSchool = float(HighSchool(x[3])),
        SchoolUnknown = float(Unknow(x[3])),
        Married = float(Married(x[4])),
        Age = float(x[5]),
        Pay_0 = float(x[6]),
        Pay_2 = float(x[7]),
        Pay_3 = float(x[8]),
        Pay_4 = float(x[9]),
        Pay_5 = float(x[10]),
        Pay_6 = float(x[11]),
        Bill_amt1 = float(x[12]),
        Bill_amt2 = float(x[13]),
        Bill_amt3 = float(x[14]),
        Bill_amt4 = float(x[15]),
        Bill_amt5 = float(x[16]),
        Bill_amt6 = float(x[17]),
        Pay_amt1 = float(x[18]),
        Pay_amt2 = float(x[19]),
        Pay_amt3 = float(x[20]),
        Pay_amt4 = float(x[21]),
        Pay_amt5 = float(x[22]),
        Pay_amt6 = float(x[23])        
    ))
datosInicio.take(2)

[Row(AAADefault=1.0, Age=24.0, Bill_amt1=3913.0, Bill_amt2=3102.0, Bill_amt3=689.0, Bill_amt4=0.0, Bill_amt5=0.0, Bill_amt6=0.0, GraduateSchool=0.0, HighSchool=0.0, Limit_Bal=20000.0, Married=1.0, Pay_0=2.0, Pay_2=2.0, Pay_3=-1.0, Pay_4=-1.0, Pay_5=-2.0, Pay_6=-2.0, Pay_amt1=0.0, Pay_amt2=689.0, Pay_amt3=0.0, Pay_amt4=0.0, Pay_amt5=0.0, Pay_amt6=0.0, SchoolUnknown=0.0, University=1.0, sex=1.0),
 Row(AAADefault=1.0, Age=26.0, Bill_amt1=2682.0, Bill_amt2=1725.0, Bill_amt3=2682.0, Bill_amt4=3272.0, Bill_amt5=3455.0, Bill_amt6=3261.0, GraduateSchool=0.0, HighSchool=0.0, Limit_Bal=120000.0, Married=0.0, Pay_0=-1.0, Pay_2=2.0, Pay_3=0.0, Pay_4=0.0, Pay_5=0.0, Pay_6=2.0, Pay_amt1=0.0, Pay_amt2=1000.0, Pay_amt3=1000.0, Pay_amt4=1000.0, Pay_amt5=0.0, Pay_amt6=2000.0, SchoolUnknown=0.0, University=1.0, sex=1.0)]

In [13]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
    
datos = spark.createDataFrame(datosInicio)

In [14]:
from pyspark.mllib.regression import LabeledPoint
from numpy import array
#We will use it for Logistic Regression with LBGS and Tree decision
datosMod = datosInicio.map(lambda x: LabeledPoint(x[0], array([x[1::]])))
datosMod.take(2)

[LabeledPoint(1.0, [24.0,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,0.0,20000.0,1.0,2.0,2.0,-1.0,-1.0,-2.0,-2.0,0.0,689.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0]),
 LabeledPoint(1.0, [26.0,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,0.0,120000.0,0.0,-1.0,2.0,0.0,0.0,0.0,2.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,0.0,1.0,1.0])]

In [15]:
#We split the data in train and test
datosMod_train, datosMod_test = datosMod.randomSplit([0.8,0.2],1234)

In [16]:
print('Num observations total:', datosMod.count())
print('Num observations train:', datosMod_train.count())
print('Num observations test:', datosMod_test.count())
print('test + train', datosMod_train.count() + datosMod_test.count() )

Num observations total: 30000
Num observations train: 24071
Num observations test: 5929
test + train 30000


## Logistic Regresion

In [17]:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
datIni = d2.map(lambda x: Row(
        label = float(x[24]),
        features =  Vectors.dense(float(x[1]),
        float(Gender(x[2])),
        float(GraduateSchool(x[3])),
        float(University(x[3])),
        float(HighSchool(x[3])),
        float(Unknow(x[3])),
        float(Married(x[4])),
        float(x[5]),
        float(x[6]),
        float(x[7]),
        float(x[8]),
        float(x[9]),
        float(x[10]),
        float(x[11]),
        float(x[12]),
        float(x[13]),
        float(x[14]),
        float(x[15]),
        float(x[16]),
        float(x[17]),
        float(x[18]),
        float(x[19]),
        float(x[20]),
        float(x[21]),
        float(x[22]),
        float(x[23])        
    )))
sdf = datIni.toDF()
LR_train, LR_test = sdf.randomSplit([0.8,0.2],1234)
LR_train.take(1)

[Row(features=DenseVector([10000.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 43.0, -1.0, 0.0, 0.0, 0.0, -2.0, -2.0, 17560.0, 9829.0, 3604.0, 0.0, 0.0, 0.0, 2537.0, 1000.0, 0.0, 0.0, 0.0, 0.0]), label=1.0)]

In [18]:
from pyspark.ml.classification import LogisticRegression
from time import time
t0 = time()
blor = LogisticRegression(maxIter=5)
blorModel = blor.fit(LR_train)
t1 = time() - t0
print(round(t1,3),'Seconds')

99.291 Seconds


In [19]:
print('1.  Coeficientes:\n  ', blorModel.coefficients)
print('\n2.  Intercept:', blorModel.intercept)


1.  Coeficientes:
   [-8.74004462692e-07,-0.10126018842,0.110234293833,0.00919486676759,-0.035717929177,-0.664689697551,0.152913729597,0.00730709213761,0.424251048269,0.146642792394,0.0810527408026,0.0331709378994,0.0228535873652,0.0127520633367,-7.60708531435e-07,-5.24216388531e-07,-5.50630020876e-07,-3.59203153813e-07,-2.08453540459e-07,-9.60976203915e-08,-6.67696430106e-06,-3.95024694158e-06,-3.08198844462e-06,-3.74317944484e-06,-3.66555457151e-06,-3.08175906809e-06]

2.  Intercept: -1.2508486105911831


In [20]:
import pandas as pd
labels = ['Limit_Bal', 'Sex', 'Graduate_School','University','High_School','SchoolUnknown', 'Married',
          'Age','Pay0', 'Pay2', 'Pay3', 'Pay4', 'Pay5', 'Pay6', 'Bill_amt1', 'Bill_amt2', 'Bill_amt3','Bill_amt4',
         'Bill_amt5','Bill_amt6', 'Pay_amt1','Pay_amt2','Pay_amt3','Pay_amt4','Pay_amt5','Pay_amt6', 'intercept']
coef = []
for i in blorModel.coefficients:
    coef.append(i)
coef.append(blorModel.intercept)
stats = pd.DataFrame(coef, index=labels, columns=['Coeficientes']).round(8)
print('Area Under the ROC curve:',blorModel.summary.areaUnderROC)
stats

Area Under the ROC curve: 0.7229381001197289


Unnamed: 0,Coeficientes
Limit_Bal,-8.7e-07
Sex,-0.1012602
Graduate_School,0.1102343
University,0.00919487
High_School,-0.03571793
SchoolUnknown,-0.6646897
Married,0.1529137
Age,0.00730709
Pay0,0.4242511
Pay2,0.1466428


In [21]:
pred = blorModel.transform(LR_train)
predicciones = pred.select('label','prediction')
Truepredicciones = predicciones.filter(predicciones.label == predicciones.prediction).count()
print('Accuracy with TRAIN:')
print('   Total sum of well predicted:', Truepredicciones)
print('   % of well predicted', Truepredicciones/predicciones.count()*100,'%. ' )

Accuracy with TRAIN:
   Total sum of well predicted: 19218
   % of well predicted 80.36632793877807 %. 


In [22]:
predTest = blorModel.transform(LR_test)
prediccionesTest = predTest.select('label','prediction')
TrueprediccionesTest = prediccionesTest.filter(prediccionesTest.label == prediccionesTest.prediction).count()
print('Accuracy with TEST:')
print('   Total sum of well predicted:', TrueprediccionesTest)
print('   % of well predicted', TrueprediccionesTest/prediccionesTest.count()*100,'%. ' )

Accuracy with TEST:
   Total sum of well predicted: 4873
   % of well predicted 80.05585674388041 %. 


### Logistic Regression with LBFGS

In [23]:
datosMod = datosInicio.map(lambda x: LabeledPoint(x[0], array([x[1::]])))
datosPred = datosInicio.map(lambda x: x[1::])
datosPred_train, datosPred_test = datosPred.randomSplit([0.8,0.2],1234)
datosMod_train, datosMod_test = datosMod.randomSplit([0.8,0.2],1234)

In [24]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

In [25]:
from time import time
t0 = time()
logit_model = LogisticRegressionWithLBFGS.train(datosMod_train)
t1 = time() - t0
print(round(t1,3),'Seconds')

80.042 Seconds


In [26]:
#Weihths
logit_model.weights

DenseVector([0.0042, -0.0, 0.0, 0.0, 0.0, -0.0, 0.0, -1.0753, -1.2225, -0.0, 0.1859, 0.5875, 0.0881, 0.0567, 0.0245, 0.0157, 0.0216, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -2.4474, -1.2054, -0.1176])

In [27]:
logit_model.setThreshold

<bound method LinearClassificationModel.setThreshold of (weights=[0.00419717371539,-6.70647031126e-06,2.92195157448e-06,1.85245123648e-06,8.48434659793e-07,-1.10559421432e-06,1.47600726139e-06,-1.07534939407,-1.22247101208,-8.12742676273e-07,0.185913882623,0.587501004966,0.0881274981419,0.0566994973616,0.0244827342975,0.0156947491265,0.0216215913302,-1.46151573395e-05,-1.19903117558e-05,-2.72322153108e-06,-2.346285812e-06,-5.55896968225e-06,-3.388310023e-06,-2.44735322471,-1.20544762595,-0.117595200924], intercept=0.0)>

In [28]:
#Train sample
pred = logit_model.predict(datosPred_train)
cont = datosMod_train.map(lambda x: x.label).zip(pred)
countTrue = cont.filter(lambda v: v[0]==v[1]).count()
print('Accuracy with TRAIN:')
print('   Total sum of well predicted:', countTrue)
print('   % of well predicted', countTrue/cont.count()*100,'%. ' )

Accuracy with TRAIN:
   Total sum of well predicted: 19558
   % of well predicted 81.2512982426987 %. 


In [29]:
#Test sample
predTest = logit_model.predict(datosPred_test)
contTest = datosMod_test.map(lambda x: x.label).zip(predTest)
countTrueTest = contTest.filter(lambda v: v[0]==v[1]).count()
print('Accuracy with TEST:')
print('   Total sum of well predicted:', countTrueTest)
print('   % of well predicted', countTrueTest/contTest.count()*100,'%. ' )


Accuracy with TEST:
   Total sum of well predicted: 4800
   % of well predicted 80.95800303592512 %. 


## Tree-Decision

In [30]:
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
t0 = time()
tree_model = DecisionTree.trainClassifier(datosMod_train, numClasses=2, categoricalFeaturesInfo={},
                                         maxDepth=7)
t1 = time() - t0
print(round(t1,3),'Seconds')
print(tree_model)

52.053 Seconds
DecisionTreeModel classifier of depth 7 with 213 nodes


In [31]:
pred = tree_model.predict(datosMod_train.map(lambda x: x.features))
comp = datosMod_train.map(lambda x: x.label).zip(pred) #zip es para mezclar los dos vectores
PredicTrue = comp.filter(lambda x: x[0]==x[1]).count()
print('Accuracy with TRAIN:')
print('   Total sum of well predicted:', PredicTrue)
print('   % of well predicted', PredicTrue/pred.count()*100,'%.' )

Accuracy with TRAIN:
   Total sum of well predicted: 19972
   % of well predicted 82.971210169914 %.


In [32]:
predTest = tree_model.predict(datosMod_test.map(lambda x: x.features))
compTest = datosMod_test.map(lambda x: x.label).zip(predTest) #zip es para mezclar los dos vectores
PredicTrueTest = compTest.filter(lambda x: x[0]==x[1]).count()
print('Accuracy with TEST:')
print('   Total sum of well predicted:', PredicTrueTest)
print('   % of well predicted', PredicTrueTest/predTest.count()*100,'%.' )

Accuracy with TEST:
   Total sum of well predicted: 4825
   % of well predicted 81.37965930173723 %.


## Random Forest

In [33]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [34]:
t0 = time()
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(sdf)
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(sdf)
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
model = pipeline.fit(LR_train)
t1 = time() - t0
print(round(t1,3),'Seconds')

207.809 Seconds


In [35]:
predictions = model.transform(LR_train)
predR = predictions.select("predictedLabel", "label")
TruepredR = predR.filter(predR.predictedLabel == predR.label).count()
print('Accuracy with TRAIN:')
print('   Total sum of well predicted:', TruepredR)
print('   % of well predicted', TruepredR/predR.count()*100,'%.' )

Accuracy with TRAIN:
   Total sum of well predicted: 19698
   % of well predicted 82.37360431564422 %.


In [36]:
predictionsTest = model.transform(LR_test)
predRTest = predictionsTest.select("predictedLabel", "label")
TruepredRTest = predRTest.filter(predRTest.predictedLabel == predRTest.label).count()
print('Accuracy with TEST:')
print('   Total sum of well predicted:', TruepredRTest)
print('   % of well predicted', TruepredRTest/predRTest.count()*100,'%.' )

Accuracy with TEST:
   Total sum of well predicted: 4972
   % of well predicted 81.68227369804501 %.


## Gradient-boosted tree classifier

In [37]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [38]:
t0 = time()
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(sdf)
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(sdf)
gbt = GBTClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=20)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, gbt])
modelGB = pipeline.fit(LR_train)
t1 = time() - t0
print(round(t1,3),'Seconds')

196.444 Seconds


In [39]:
predictionsGB = modelGB.transform(LR_train)
predR = predictionsGB.select("prediction", "label")
TruepredR = predR.filter(predR.prediction == predR.label).count()
print('Accuracy with TRAIN:')
print('   Total sum of well predicted:', TruepredR)
print('   % of well predicted', TruepredR/predR.count()*100,'%.' )

Accuracy with TRAIN:
   Total sum of well predicted: 19877
   % of well predicted 83.12215113118387 %.


In [40]:
predictionsGBTest = modelGB.transform(LR_test)
predRTest = predictionsGBTest.select("prediction", "label")
TruepredRTest = predRTest.filter(predRTest.prediction == predRTest.label).count()
print('Accuracy with TEST:')
print('   Total sum of well predicted:', TruepredRTest)
print('   % of well predicted', TruepredRTest/predRTest.count()*100,'%.' )

Accuracy with TEST:
   Total sum of well predicted: 4982
   % of well predicted 81.84655823886973 %.


## Multilayer perceptron classifier

In [41]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

t0 = time()
layers = [26, 9, 5]
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
modelMPC = trainer.fit(LR_train)
t1 = time()-t0
print(round(t1,3),'Seconds')

49.471 Seconds


In [42]:
predictionsMPC = modelMPC.transform(LR_train)
predR = predictionsMPC.select("prediction", "label")
TruepredR = predR.filter(predR.prediction == predR.label).count()
print('Accuracy with TRAIN:')
print('   Total sum of well predicted:', TruepredR)
print('   % of well predicted', TruepredR/predR.count()*100,'%.' )

Accuracy with TRAIN:
   Total sum of well predicted: 18640
   % of well predicted 77.94923263496842 %.


In [43]:
predictionsMPCTest = modelMPC.transform(LR_test)
predRTest = predictionsMPCTest.select("prediction", "label")
TruepredRTest = predRTest.filter(predRTest.prediction == predRTest.label).count()
print('Accuracy with TEST:')
print('   Total sum of well predicted:', TruepredRTest)
print('   % of well predicted', TruepredRTest/predRTest.count()*100,'%.' )

Accuracy with TEST:
   Total sum of well predicted: 4724
   % of well predicted 77.60801708559225 %.


# SVM (Support Vector Machine)

In [44]:
from pyspark.mllib.classification import SVMWithSGD, SVMModel

In [45]:
t0 = time()
modelSVM = SVMWithSGD.train(datosMod_train, iterations=1500)
t1 = time()-t0
print(round(t1,3),'Seconds')

209.362 Seconds


In [46]:
predR.take(3)

[Row(prediction=0.0, label=1.0),
 Row(prediction=0.0, label=0.0),
 Row(prediction=0.0, label=0.0)]

In [47]:
predictionsSVM = modelSVM.predict(datosPred_train)
predR = datosMod_train.map(lambda x: x.label).zip(predictionsSVM)
TruepredR = predR.filter(lambda v: v[0]==v[1]).count()
print('Accuracy with TRAIN:')
print('   Total sum of well predicted:', TruepredR)
print('   % of well predicted', TruepredR/predR.count()*100,'%.' )

Accuracy with TRAIN:
   Total sum of well predicted: 5668
   % of well predicted 23.547006771633917 %.


In [48]:
predictionsSVMTest = modelSVM.predict(datosPred_test)
predRTest = datosMod_test.map(lambda x: x.label).zip(predictionsSVMTest)
TruepredRTest = predRTest.filter(lambda v: v[0]==v[1]).count()
print('Accuracy with TEST:')
print('   Total sum of well predicted:', TruepredRTest)
print('   % of well predicted', TruepredRTest/predRTest.count()*100,'%.' )

Accuracy with TEST:
   Total sum of well predicted: 1386
   % of well predicted 23.376623376623375 %.


Clearly, SVM is not a good method!