# Machine Learning & Spark


## Characteristics of Spark
Spark is currently the most popular technology for processing large quantities of data. Not only is it able to handle enormous data volumes, but it does so very efficiently too! Also, unlike some other distributed computing technologies, developing with Spark is a pleasure.

Which of these describe Spark?

1. Spark is a framework for cluster computing.

2. Spark does most processing in memory.

3. Spark has a high-level API, which conceals a lot of complexity.

4. <b> All of the above.<b>


## Components in a Spark Cluster
Spark is a distributed computing platform. It achieves efficiency by distributing data and computation across a cluster of computers.

A Spark cluster consists of a number of hardware and software components which work together.

Which of these is not part of a Spark cluster?

1. One or more nodes

2. A cluster manager

3. <b> A load balancer </b>

4. Executors


### Creating a SparkSession
In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a SparkSession object.

The `SparkSession` class has a `builder` attribute, which is an instance of the `Builder` class. The `Builder` class exposes three important methods that let you:

specify the location of the master node;
name the application (optional); and
retrieve an existing `SparkSession` or, if there is none, create a new one.
The `SparkSession` class has a `version` attribute which gives the version of Spark.

* Import the `SparkSession` class from `pyspark.sql`.
* Create a `SparkSession` object connected to a local cluster. Use all available cores. Name the application `'test'`.
* Use the `SparkSession` object to retrieve the version of Spark running on the cluster. Note: The version might be different to the one that's used in the presentation (it gets updated from time to time).
* Shut down the cluster.

In [None]:
# Import the PySpark module
from pyspark.sql import SparkSession

# Create SparkSession object
spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('test') \
                    .getOrCreate()

# What version of Spark?
# (Might be different to what you saw in the presentation!)
print(spark.version)

# Terminate the cluster
spark.stop()

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

## Loading Data

### Loading flights data
In this exercise you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format here.

Notes on CSV format:

* fields are separated by a comma (this is the default separator) and
* missing data are denoted by the string 'NA'.

Data dictionary:

* `mon` — month (integer between 1 and 12)
* `dom` — day of month (integer between 1 and 31)
* `dow` — day of week (integer; 1 = Monday and 7 = Sunday)
* `org` — origin airport (IATA code)
* `mile` — distance (miles)
* `carrier` — carrier (IATA code)
* `depart` — departure time (decimal hour)
* `duration` — expected duration (minutes)
* `delay` — delay (minutes)

Note: The data have been aggressively down-sampled.

In [13]:
# Read data from CSV file
flights = spark.read.csv('flights.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,                 
                         nullValue='NA'
                        )

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
print(flights.dtypes)

The data contain 50000 records.
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows

[('mon', 'int'), ('dom', 'int'), ('dow', 'int'), ('carrier', 'string'), ('flight', 'int'), ('org', 'string'), ('mile', 'int'), ('depart', 'double'), ('duration', 'int'), ('delay', 'int')]


### Loading SMS spam data
You've seen that it's possible to infer data types directly from the data. Sometimes it's convenient to have direct control over the column types. You do this by defining an explicit schema.

The file `sms.csv` contains a selection of SMS messages which have been classified as either 'spam' or 'ham'. These data have been adapted from the UCI Machine Learning Repository. There are a total of 5574 SMS, of which 747 have been labelled as spam.

Notes on CSV format:

* no header record and
* fields are separated by a semicolon (this is not the default separator).

Data dictionary:

* `id` — record identifier
* `text` — content of SMS message
* `label` — spam or ham (integer; 0 = ham and 1 = spam)

In [6]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Load data from a delimited file
sms = spark.read.csv("sms.csv", sep=";", header=False, schema=schema)

# Print schema of DataFrame
sms.printSchema()

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)



In [7]:
sms.show(5)

+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  1|Sorry, I'll call ...|    0|
|  2|Dont worry. I gue...|    0|
|  3|Call FREEPHONE 08...|    1|
|  4|Win a 1000 cash p...|    1|
|  5|Go until jurong p...|    0|
+---+--------------------+-----+
only showing top 5 rows



# Classification


## Data Preparation


### Removing columns and rows
You previously loaded airline flight data from a CSV file. You're going to develop a model which will predict whether or not a given flight will be delayed.

In this exercise you need to trim those data down by:

1. removing an uninformative column and
2. removing rows which do not have information about whether or not a flight was delayed.
The data are available as `flights`.

In [14]:
# Remove the 'flight' column
flights_drop_column = flights.drop('flight')

# Number of records with missing 'delay' values
flights_drop_column.filter('delay IS NULL').count()

2978

In [15]:
# Remove records with missing 'delay' values
flights_valid_delay = flights_drop_column.filter('delay IS NOT NULL').drop()

# Remove records with missing values in any column and get the number of remaining rows
flights_none_missing = flights_valid_delay.dropna()
print(flights_none_missing.count())

47022


### Column manipulation
The Federal Aviation Administration (FAA) considers a flight to be "delayed" when it arrives 15 minutes or more after its scheduled time.

The next step of preparing the flight data has two parts:

1. convert the units of distance, replacing the `mile` column with a `km` column; and
2. create a Boolean column indicating whether or not a flight was delayed.

In [17]:
# Import the required function
from pyspark.sql.functions import round

# Convert 'mile' to 'km' and drop 'mile' column
flights_km = flights.withColumn('km', round(flights.mile * 1.60934, 0)) \
                    .drop('mile')

In [18]:
flights_km.show()

+---+---+---+-------+------+---+------+--------+-----+------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|
+---+---+---+-------+------+---+------+--------+-----+------+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null| 415.0|
|  5|  2|  1|     UA|   704|SFO|  7.98|     102|    2| 885.0|
|  7|  2|  6|     AA|   380|ORD| 10.83|     135|   54|1180.0|
|  1| 16|  6|     UA|  1477|ORD|   8.0|     232|   -7|2317.0|
|  1| 22|  5|     UA|   620|SJC|  7.98|     250|  -13|2943.0|
| 11|  8|  1|     OO|  5590|SFO|  7.77|      60|   88| 254.0|
|  4| 26|  1|     AA|  1144|SFO| 13.25|     210|  -10|2356.0|
|  4| 25|  0|     AA|   321|ORD| 13.75|     160|   31|1574.0|
|  8| 30|  2|     UA|   646|ORD| 13.28|     151|   16|1157.0|
|  3| 16

In [19]:
# Create 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.withColumn('label', ('delay').cast('integer'))

AttributeError: 'str' object has no attribute 'cast'

In [21]:
flights_km.filter('delay > 15').show(10)

+---+---+---+-------+------+---+------+--------+-----+------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|
+---+---+---+-------+------+---+------+--------+-----+------+
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|
|  7|  2|  6|     AA|   380|ORD| 10.83|     135|   54|1180.0|
| 11|  8|  1|     OO|  5590|SFO|  7.77|      60|   88| 254.0|
|  4| 25|  0|     AA|   321|ORD| 13.75|     160|   31|1574.0|
|  8| 30|  2|     UA|   646|ORD| 13.28|     151|   16|1157.0|
|  0|  3|  4|     AA|  1559|LGA| 17.08|     190|   32|1765.0|
|  5|  9|  1|     UA|   770|SFO|  12.7|     158|   20|1556.0|
|  3| 10|  4|     B6|   937|ORD| 17.58|     265|  155|2792.0|
| 11| 15|  1|     AA|  2303|ORD|  6.75|     160|   23|1291.0|
|  8| 18|  4|     UA|   802|SJC|  6.33|     160|   17|1526.0|
+---+---+---+-------+------+---+------+--------+-----+------+
only showing top 10 rows



In [22]:
# Create 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.withColumn('label', (flights_km.delay >= 15).cast('integer'))

# Check first five records
flights_km.show(5)

+---+---+---+-------+------+---+------+--------+-----+------+-----+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|label|
+---+---+---+-------+------+---+------+--------+-----+------+-----+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0| null|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|    1|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|    0|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|    0|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null| 415.0| null|
+---+---+---+-------+------+---+------+--------+-----+------+-----+
only showing top 5 rows



### Categorical columns
In the flights data there are two columns, `carrier` and `org`, which hold categorical data. You need to transform those columns into indexed numerical values.

In [27]:
from pyspark.ml.feature import StringIndexer

# Create an indexer
indexer = StringIndexer(inputCol='carrier', outputCol='carrier_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(flights_km)

# Indexer creates a new column with numeric index values
flights_km = indexer_model.transform(flights_km)

# Repeat the process for the other categorical feature
flights_km = StringIndexer(inputCol='org', outputCol='org_idx').fit(flights_km).transform(flights_km)

In [29]:
flights_km.show(8)

+---+---+---+-------+------+---+------+--------+-----+------+-----+-----------+-------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|label|carrier_idx|org_idx|
+---+---+---+-------+------+---+------+--------+-----+------+-----+-----------+-------+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0| null|        6.0|    2.0|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null| 415.0| null|        1.0|    0.0|
|  5|  2|  1|     UA|   704|SFO|  7.98|     102|    2| 885.0|    0|        0.0|    1.0|
|  7|  2|  6|     AA|   380|ORD| 10.83|     135|   54|1180.0|    1|        1.0|    0.0|
|  1| 16|  6|     UA|  1477|ORD|   8.0|     232|   -7|2317.0|    0|        0.0|    0.0|
+---+---+---+-------+------+---+

In [30]:
# Import the necessary class
from pyspark.ml.feature import VectorAssembler

# Create an assembler object
assembler = VectorAssembler(inputCols=[
    'mon', 'dom', 'dow', 'carrier_idx', 'org_idx', 'km', 'depart', 'duration'
], outputCol='features')

# Consolidate predictor columns
flights_assembled = assembler.transform(flights_km)

# Check the resulting column
flights_assembled.select('features', 'delay').show(5, truncate=False)

+-----------------------------------------+-----+
|features                                 |delay|
+-----------------------------------------+-----+
|[11.0,20.0,6.0,6.0,2.0,3465.0,9.48,351.0]|null |
|[0.0,22.0,2.0,0.0,0.0,509.0,16.33,82.0]  |30   |
|[2.0,20.0,4.0,0.0,1.0,542.0,6.17,82.0]   |-8   |
|[9.0,13.0,1.0,1.0,0.0,1989.0,10.33,195.0]|-5   |
|[4.0,2.0,5.0,1.0,0.0,415.0,8.92,65.0]    |null |
+-----------------------------------------+-----+
only showing top 5 rows



## Decision Tree

### Train/test split
To objectively assess a Machine Learning model you need to be able to test it on an independent set of data. You can't use the same data that you used to train the model: of course the model will perform (relatively) well on those data!

You will split the data into two components:

* training data (used to train the model) and
* testing data (used to test the model).

In [31]:
flights_assembled = flights_assembled.drop('flight')

In [32]:
# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights_assembled.randomSplit([0.8, 0.2], seed=17)

# Check that training set has around 80% of records
training_ratio = flights_train.count() / flights_assembled.count()
print(training_ratio)

0.79824


### Build a Decision Tree
Now that you've split the flights data into training and testing sets, you can use the training set to fit a Decision Tree model.

The data are available as `flights_train` and `flights_test`.

In [None]:
# Import the Decision Tree Classifier class
from pyspark.ml.classification import DecisionTreeClassifier

# Create a classifier object and fit to the training data
tree = DecisionTreeClassifier()
tree_model = tree.fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
prediction = tree_model.transform(flights_test)
prediction.select('label', 'prediction', 'probability').show(5, False)

### Evaluate the Decision Tree
You can assess the quality of your model by evaluating how well it performs on the testing data. Because the model was not trained on these data, this represents an objective assessment of the model.

A confusion matrix gives a useful breakdown of predictions versus known values. It has four cells which represent the counts of:

True Negatives (TN) — model predicts negative outcome & known outcome is negative<br>
True Positives (TP) — model predicts positive outcome & known outcome is positive<br>
False Negatives (FN) — model predicts negative outcome but known outcome is positive<br>
False Positives (FP) — model predicts positive outcome but known outcome is negative.

In [None]:
# Create a confusion matrix
prediction.groupBy('label', 'prediction').count().show()

# Calculate the elements of the confusion matrix
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.filter('prediction = 1 AND label = prediction').count()
FN = prediction.filter('prediction = 0 AND label != prediction').count()
FP = prediction.filter('prediction = 1 AND label != prediction').count()

# Accuracy measures the proportion of correct predictions
accuracy = (TN+TP)/(TN+TP+FN+FP)
print(accuracy)

## Logistic Regression


### Build a Logistic Regression model
You've already built a Decision Tree model using the flights data. Now you're going to create a Logistic Regression model on the same data.

The objective is to predict whether a flight is likely to be delayed by at least 15 minutes (label 1) or not (label 0).

Although you have a variety of predictors at your disposal, you'll only use the `mon`, `depart` and `duration` columns for the moment. These are numerical features which can immediately be used for a Logistic Regression model. You'll need to do a little more work before you can include categorical features. Stay tuned!

In [44]:
# Import the logistic regression class
from pyspark.ml.classification import LogisticRegression

In [None]:
# Create a classifier object and train on training data
logistic = LogisticRegression().fit(flights_train)

# Create predictions for the testing data and show confusion matrix
prediction = logistic.transform(flights_test)
prediction.groupBy('label', 'prediction').count().show()

In [None]:
# Calculate the elements of the confusion matrix
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.filter('prediction = 1 AND label = prediction').count()
FN = prediction.filter('prediction = 0 AND label != prediction').count()
FP = prediction.filter('prediction = 1 AND label != prediction').count()

### Evaluate the Logistic Regression model
Accuracy is generally not a very reliable metric because it can be biased by the most common target class.

There are two other useful metrics:

* precision and
* recall.

Precision is the proportion of positive predictions which are correct. For all flights which are predicted to be delayed, what proportion is actually delayed?

Recall is the proportion of positives outcomes which are correctly predicted. For all delayed flights, what proportion is correctly predicted by the model?

The precision and recall are generally formulated in terms of the positive target class. But it's also possible to calculate weighted versions of these metrics which look at both target classes.

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

# Calculate precision and recall
precision = TP / (TP + FP)
recall = TP / (TP + FN)
print('precision = {:.2f}\nrecall    = {:.2f}'.format(precision, recall))

# Find weighted precision
multi_evaluator = MulticlassClassificationEvaluator()
weighted_precision = multi_evaluator.evaluate(prediction, {multi_evaluator.metricName: "weightedPrecision"})

# Find AUC
binary_evaluator = BinaryClassificationEvaluator()
auc = binary_evaluator.evaluate(prediction, {binary_evaluator.metricName: "areaUnderROC"})

## Turning Text into Tables


### Punctuation, numbers and tokens
At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either "spam" (label 1) or "ham" (label 0). You're now going to use those data to build a classifier model.

But first you'll need to prepare the SMS messages as follows:

* remove punctuation and numbers
* tokenize (split into individual words)
* remove stop words
* apply the hashing trick
* convert to TF-IDF representation.

In this exercise you'll remove punctuation and numbers, then tokenize the messages.

The SMS data are available as `sms`.

In [35]:
# Import the necessary functions
from pyspark.sql.functions import regexp_replace
from pyspark.ml.feature import Tokenizer

# Remove punctuation (REGEX provided) and numbers
wrangled = sms.withColumn('text', regexp_replace(sms.text, '[_():;,.!?\\-]', ' '))
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, '[0-9]', ' '))

# Merge multiple spaces
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))

# Split the text into words
wrangled = Tokenizer(inputCol='text', outputCol='words').transform(wrangled)

wrangled.show(4, truncate=False)

+---+----------------------------------+-----+------------------------------------------+
|id |text                              |label|words                                     |
+---+----------------------------------+-----+------------------------------------------+
|1  |Sorry I'll call later in meeting  |0    |[sorry, i'll, call, later, in, meeting]   |
|2  |Dont worry I guess he's busy      |0    |[dont, worry, i, guess, he's, busy]       |
|3  |Call FREEPHONE now                |1    |[call, freephone, now]                    |
|4  |Win a cash prize or a prize worth |1    |[win, a, cash, prize, or, a, prize, worth]|
+---+----------------------------------+-----+------------------------------------------+
only showing top 4 rows



### Stop words and hashing
The next steps will be to remove stop words and then apply the hashing trick, converting the results into a TF-IDF.

A quick reminder about these concepts:

* The hashing trick provides a fast and space-efficient way to map a very large (possibly infinite) set of items (in this case, all words contained in the SMS messages) onto a smaller, finite number of values.
* The TF-IDF matrix reflects how important a word is to each document. It takes into account both the frequency of the word within each document but also the frequency of the word across all of the documents in the collection.

The tokenized SMS data are stored in `sms` in a column named `words`. You've cleaned up the handling of spaces in the data so that the tokenized text is neater.

In [36]:
from pyspark.ml.feature import StopWordsRemover, HashingTF, IDF

# Remove stop words.
wrangled = StopWordsRemover(inputCol='words', outputCol='terms')\
      .transform(wrangled)

In [38]:
wrangled.show(5)

+---+--------------------+-----+--------------------+--------------------+
| id|                text|label|               words|               terms|
+---+--------------------+-----+--------------------+--------------------+
|  1|Sorry I'll call l...|    0|[sorry, i'll, cal...|[sorry, call, lat...|
|  2|Dont worry I gues...|    0|[dont, worry, i, ...|[dont, worry, gue...|
|  3| Call FREEPHONE now |    1|[call, freephone,...|   [call, freephone]|
|  4|Win a cash prize ...|    1|[win, a, cash, pr...|[win, cash, prize...|
|  5|Go until jurong p...|    0|[go, until, juron...|[go, jurong, poin...|
+---+--------------------+-----+--------------------+--------------------+
only showing top 5 rows



In [39]:
# Apply the hashing trick
wrangled = HashingTF(inputCol="terms", outputCol="hash", numFeatures=1024)\
      .transform(wrangled)

In [40]:
wrangled.show(5)

+---+--------------------+-----+--------------------+--------------------+--------------------+
| id|                text|label|               words|               terms|                hash|
+---+--------------------+-----+--------------------+--------------------+--------------------+
|  1|Sorry I'll call l...|    0|[sorry, i'll, cal...|[sorry, call, lat...|(1024,[138,344,37...|
|  2|Dont worry I gues...|    0|[dont, worry, i, ...|[dont, worry, gue...|(1024,[53,233,329...|
|  3| Call FREEPHONE now |    1|[call, freephone,...|   [call, freephone]|(1024,[138,396],[...|
|  4|Win a cash prize ...|    1|[win, a, cash, pr...|[win, cash, prize...|(1024,[31,69,387,...|
|  5|Go until jurong p...|    0|[go, until, juron...|[go, jurong, poin...|(1024,[116,262,33...|
+---+--------------------+-----+--------------------+--------------------+--------------------+
only showing top 5 rows



In [42]:
# Convert hashed symbols to TF-IDF
tf_idf = IDF(inputCol="hash", outputCol="features")\
      .fit(wrangled).transform(wrangled)
      
tf_idf.select('terms', 'features').show(4, truncate=False)

+--------------------------------+----------------------------------------------------------------------------------------------------+
|terms                           |features                                                                                            |
+--------------------------------+----------------------------------------------------------------------------------------------------+
|[sorry, call, later, meeting]   |(1024,[138,344,378,1006],[2.2391682769656747,2.892706319430574,3.684405173719015,4.244020961654438])|
|[dont, worry, guess, busy]      |(1024,[53,233,329,858],[4.618714411095849,3.557143394108088,4.618714411095849,4.937168142214383])   |
|[call, freephone]               |(1024,[138,396],[2.2391682769656747,3.3843005812686773])                                            |
|[win, cash, prize, prize, worth]|(1024,[31,69,387,428],[3.7897656893768414,7.284881949239966,4.4671645129686475,3.898659777615979])  |
+--------------------------------+--------------

### Training a spam classifier
The SMS data have now been prepared for building a classifier. Specifically, this is what you have done:

* removed numbers and punctuation
* split the messages into words (or "tokens")
* removed stop words
* applied the hashing trick and
* converted to a TF-IDF representation.

Next you'll need to split the TF-IDF data into training and testing sets. Then you'll use the training data to fit a Logistic Regression model and finally evaluate the performance of that model on the test|ing data.


In [45]:
# Split the data into training and testing sets
sms_train, sms_test = tf_idf.randomSplit([0.8, 0.2], seed=13)

# Fit a Logistic Regression model to the training data
logistic = LogisticRegression(regParam=0.2).fit(sms_train)

# Make predictions on the testing data
prediction = logistic.transform(sms_test)

# Create a confusion matrix, comparing predictions to known labels
prediction.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       1.0|  124|
|    0|       0.0|  987|
|    0|       1.0|    3|
|    1|       0.0|   47|
+-----+----------+-----+



# Regression