# Logistic Regression

Let's see an example of how to run a logistic regression with Python and Spark! This is documentation example, we will quickly run through this and then show a more realistic example, afterwards, you will have another consulting project!

In [10]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logregdoc').getOrCreate()

In [11]:
from pyspark.ml.classification import LogisticRegression

In [12]:
# Load training data
training = spark.read.format("libsvm").load("sample_libsvm_data.txt")

lr = LogisticRegression() # everything in default

# Fit the model
lrModel = lr.fit(training)

trainingSummary = lrModel.summary

23/08/15 20:10:07 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.


In [13]:
trainingSummary.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[127,128,129...|[20.3777627514875...|[0.99999999858729...|       0.0|
|  1.0|(692,[158,159,160...|[-21.114014198852...|[6.76550380011201...|       1.0|
|  1.0|(692,[124,125,126...|[-23.743613234684...|[4.87842678711891...|       1.0|
|  1.0|(692,[152,153,154...|[-19.192574012724...|[4.62137287296030...|       1.0|
|  1.0|(692,[151,152,153...|[-20.125398874706...|[1.81823629111716...|       1.0|
|  0.0|(692,[129,130,131...|[20.4890549504206...|[0.99999999873608...|       0.0|
|  1.0|(692,[158,159,160...|[-21.082940212796...|[6.97903542836027...|       1.0|
|  1.0|(692,[99,100,101,...|[-19.622713503561...|[3.00582577442810...|       1.0|
|  0.0|(692,[154,155,156...|[21.1594863606525...|[0.99999999935352...|       0.0|
|  0.0|(692,[127

In [14]:
# May change soon!
from pyspark.mllib.evaluation import MulticlassMetrics

In [15]:
lrModel.evaluate(training)

<pyspark.ml.classification.BinaryLogisticRegressionSummary at 0x7f43cc8ab0d0>

In [16]:
# Usually would do this on a separate test set!
predictionAndLabels = lrModel.evaluate(training)

In [17]:
predictionAndLabels.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[127,128,129...|[20.3777627514875...|[0.99999999858729...|       0.0|
|  1.0|(692,[158,159,160...|[-21.114014198852...|[6.76550380011201...|       1.0|
|  1.0|(692,[124,125,126...|[-23.743613234684...|[4.87842678711891...|       1.0|
|  1.0|(692,[152,153,154...|[-19.192574012724...|[4.62137287296030...|       1.0|
|  1.0|(692,[151,152,153...|[-20.125398874706...|[1.81823629111716...|       1.0|
|  0.0|(692,[129,130,131...|[20.4890549504206...|[0.99999999873608...|       0.0|
|  1.0|(692,[158,159,160...|[-21.082940212796...|[6.97903542836027...|       1.0|
|  1.0|(692,[99,100,101,...|[-19.622713503561...|[3.00582577442810...|       1.0|
|  0.0|(692,[154,155,156...|[21.1594863606525...|[0.99999999935352...|       0.0|
|  0.0|(692,[127

In [18]:
predictionAndLabels = predictionAndLabels.predictions.select('label','prediction')

In [19]:
predictionAndLabels.show()

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  0.0|       0.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  1.0|       1.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  1.0|       1.0|
|  0.0|       0.0|
|  1.0|       1.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  1.0|       1.0|
|  1.0|       1.0|
+-----+----------+
only showing top 20 rows



## Evaluators

Evaluators will be a very important part of our pipline when working with Machine Learning, let's see some basics for Logistic Regression, useful links:

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator

In [20]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator

In [21]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label')

In [22]:
# For multiclass
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label',
                                             metricName='accuracy')

In [23]:
acc = evaluator.evaluate(predictionAndLabels)

In [24]:
acc

1.0

Okay let's move on see some more examples!