# Machine Learning & Spark


## Characteristics of Spark
Spark is currently the most popular technology for processing large quantities of data. Not only is it able to handle enormous data volumes, but it does so very efficiently too! Also, unlike some other distributed computing technologies, developing with Spark is a pleasure.

Which of these describe Spark?

1. Spark is a framework for cluster computing.

2. Spark does most processing in memory.

3. Spark has a high-level API, which conceals a lot of complexity.

4. <b> All of the above.<b>


## Components in a Spark Cluster
Spark is a distributed computing platform. It achieves efficiency by distributing data and computation across a cluster of computers.

A Spark cluster consists of a number of hardware and software components which work together.

Which of these is not part of a Spark cluster?

1. One or more nodes

2. A cluster manager

3. <b> A load balancer </b>

4. Executors


### Creating a SparkSession
In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a SparkSession object.

The `SparkSession` class has a `builder` attribute, which is an instance of the `Builder` class. The `Builder` class exposes three important methods that let you:

specify the location of the master node;
name the application (optional); and
retrieve an existing `SparkSession` or, if there is none, create a new one.
The `SparkSession` class has a `version` attribute which gives the version of Spark.

* Import the `SparkSession` class from `pyspark.sql`.
* Create a `SparkSession` object connected to a local cluster. Use all available cores. Name the application `'test'`.
* Use the `SparkSession` object to retrieve the version of Spark running on the cluster. Note: The version might be different to the one that's used in the presentation (it gets updated from time to time).
* Shut down the cluster.

In [None]:
# Import the PySpark module
from pyspark.sql import SparkSession

# Create SparkSession object
spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('test') \
                    .getOrCreate()

# What version of Spark?
# (Might be different to what you saw in the presentation!)
print(spark.version)

# Terminate the cluster
spark.stop()

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

## Loading Data

### Loading flights data
In this exercise you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format here.

Notes on CSV format:

* fields are separated by a comma (this is the default separator) and
* missing data are denoted by the string 'NA'.

Data dictionary:

* `mon` — month (integer between 1 and 12)
* `dom` — day of month (integer between 1 and 31)
* `dow` — day of week (integer; 1 = Monday and 7 = Sunday)
* `org` — origin airport (IATA code)
* `mile` — distance (miles)
* `carrier` — carrier (IATA code)
* `depart` — departure time (decimal hour)
* `duration` — expected duration (minutes)
* `delay` — delay (minutes)

Note: The data have been aggressively down-sampled.

In [13]:
# Read data from CSV file
flights = spark.read.csv('flights.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,                 
                         nullValue='NA'
                        )

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
print(flights.dtypes)

The data contain 50000 records.
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows

[('mon', 'int'), ('dom', 'int'), ('dow', 'int'), ('carrier', 'string'), ('flight', 'int'), ('org', 'string'), ('mile', 'int'), ('depart', 'double'), ('duration', 'int'), ('delay', 'int')]


### Loading SMS spam data
You've seen that it's possible to infer data types directly from the data. Sometimes it's convenient to have direct control over the column types. You do this by defining an explicit schema.

The file `sms.csv` contains a selection of SMS messages which have been classified as either 'spam' or 'ham'. These data have been adapted from the UCI Machine Learning Repository. There are a total of 5574 SMS, of which 747 have been labelled as spam.

Notes on CSV format:

* no header record and
* fields are separated by a semicolon (this is not the default separator).

Data dictionary:

* `id` — record identifier
* `text` — content of SMS message
* `label` — spam or ham (integer; 0 = ham and 1 = spam)

In [6]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Load data from a delimited file
sms = spark.read.csv("sms.csv", sep=";", header=False, schema=schema)

# Print schema of DataFrame
sms.printSchema()

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)



In [7]:
sms.show(5)

+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  1|Sorry, I'll call ...|    0|
|  2|Dont worry. I gue...|    0|
|  3|Call FREEPHONE 08...|    1|
|  4|Win a 1000 cash p...|    1|
|  5|Go until jurong p...|    0|
+---+--------------------+-----+
only showing top 5 rows



# Classification


## Data Preparation


### Removing columns and rows
You previously loaded airline flight data from a CSV file. You're going to develop a model which will predict whether or not a given flight will be delayed.

In this exercise you need to trim those data down by:

1. removing an uninformative column and
2. removing rows which do not have information about whether or not a flight was delayed.
The data are available as `flights`.

In [14]:
# Remove the 'flight' column
flights_drop_column = flights.drop('flight')

# Number of records with missing 'delay' values
flights_drop_column.filter('delay IS NULL').count()

2978

In [15]:
# Remove records with missing 'delay' values
flights_valid_delay = flights_drop_column.filter('delay IS NOT NULL').drop()

# Remove records with missing values in any column and get the number of remaining rows
flights_none_missing = flights_valid_delay.dropna()
print(flights_none_missing.count())

47022


### Column manipulation
The Federal Aviation Administration (FAA) considers a flight to be "delayed" when it arrives 15 minutes or more after its scheduled time.

The next step of preparing the flight data has two parts:

1. convert the units of distance, replacing the `mile` column with a `km` column; and
2. create a Boolean column indicating whether or not a flight was delayed.

In [17]:
# Import the required function
from pyspark.sql.functions import round

# Convert 'mile' to 'km' and drop 'mile' column
flights_km = flights.withColumn('km', round(flights.mile * 1.60934, 0)) \
                    .drop('mile')

In [18]:
flights_km.show()

+---+---+---+-------+------+---+------+--------+-----+------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|
+---+---+---+-------+------+---+------+--------+-----+------+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null| 415.0|
|  5|  2|  1|     UA|   704|SFO|  7.98|     102|    2| 885.0|
|  7|  2|  6|     AA|   380|ORD| 10.83|     135|   54|1180.0|
|  1| 16|  6|     UA|  1477|ORD|   8.0|     232|   -7|2317.0|
|  1| 22|  5|     UA|   620|SJC|  7.98|     250|  -13|2943.0|
| 11|  8|  1|     OO|  5590|SFO|  7.77|      60|   88| 254.0|
|  4| 26|  1|     AA|  1144|SFO| 13.25|     210|  -10|2356.0|
|  4| 25|  0|     AA|   321|ORD| 13.75|     160|   31|1574.0|
|  8| 30|  2|     UA|   646|ORD| 13.28|     151|   16|1157.0|
|  3| 16

In [19]:
# Create 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.withColumn('label', ('delay').cast('integer'))

AttributeError: 'str' object has no attribute 'cast'

In [21]:
flights_km.filter('delay > 15').show(10)

+---+---+---+-------+------+---+------+--------+-----+------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|
+---+---+---+-------+------+---+------+--------+-----+------+
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|
|  7|  2|  6|     AA|   380|ORD| 10.83|     135|   54|1180.0|
| 11|  8|  1|     OO|  5590|SFO|  7.77|      60|   88| 254.0|
|  4| 25|  0|     AA|   321|ORD| 13.75|     160|   31|1574.0|
|  8| 30|  2|     UA|   646|ORD| 13.28|     151|   16|1157.0|
|  0|  3|  4|     AA|  1559|LGA| 17.08|     190|   32|1765.0|
|  5|  9|  1|     UA|   770|SFO|  12.7|     158|   20|1556.0|
|  3| 10|  4|     B6|   937|ORD| 17.58|     265|  155|2792.0|
| 11| 15|  1|     AA|  2303|ORD|  6.75|     160|   23|1291.0|
|  8| 18|  4|     UA|   802|SJC|  6.33|     160|   17|1526.0|
+---+---+---+-------+------+---+------+--------+-----+------+
only showing top 10 rows



In [22]:
# Create 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.withColumn('label', (flights_km.delay >= 15).cast('integer'))

# Check first five records
flights_km.show(5)

+---+---+---+-------+------+---+------+--------+-----+------+-----+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|label|
+---+---+---+-------+------+---+------+--------+-----+------+-----+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0| null|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|    1|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|    0|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|    0|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null| 415.0| null|
+---+---+---+-------+------+---+------+--------+-----+------+-----+
only showing top 5 rows



### Categorical columns
In the flights data there are two columns, `carrier` and `org`, which hold categorical data. You need to transform those columns into indexed numerical values.

In [27]:
from pyspark.ml.feature import StringIndexer

# Create an indexer
indexer = StringIndexer(inputCol='carrier', outputCol='carrier_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(flights_km)

# Indexer creates a new column with numeric index values
flights_km = indexer_model.transform(flights_km)

# Repeat the process for the other categorical feature
flights_km = StringIndexer(inputCol='org', outputCol='org_idx').fit(flights_km).transform(flights_km)

In [29]:
flights_km.show(8)

+---+---+---+-------+------+---+------+--------+-----+------+-----+-----------+-------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|label|carrier_idx|org_idx|
+---+---+---+-------+------+---+------+--------+-----+------+-----+-----------+-------+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0| null|        6.0|    2.0|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null| 415.0| null|        1.0|    0.0|
|  5|  2|  1|     UA|   704|SFO|  7.98|     102|    2| 885.0|    0|        0.0|    1.0|
|  7|  2|  6|     AA|   380|ORD| 10.83|     135|   54|1180.0|    1|        1.0|    0.0|
|  1| 16|  6|     UA|  1477|ORD|   8.0|     232|   -7|2317.0|    0|        0.0|    0.0|
+---+---+---+-------+------+---+

In [30]:
# Import the necessary class
from pyspark.ml.feature import VectorAssembler

# Create an assembler object
assembler = VectorAssembler(inputCols=[
    'mon', 'dom', 'dow', 'carrier_idx', 'org_idx', 'km', 'depart', 'duration'
], outputCol='features')

# Consolidate predictor columns
flights_assembled = assembler.transform(flights_km)

# Check the resulting column
flights_assembled.select('features', 'delay').show(5, truncate=False)

+-----------------------------------------+-----+
|features                                 |delay|
+-----------------------------------------+-----+
|[11.0,20.0,6.0,6.0,2.0,3465.0,9.48,351.0]|null |
|[0.0,22.0,2.0,0.0,0.0,509.0,16.33,82.0]  |30   |
|[2.0,20.0,4.0,0.0,1.0,542.0,6.17,82.0]   |-8   |
|[9.0,13.0,1.0,1.0,0.0,1989.0,10.33,195.0]|-5   |
|[4.0,2.0,5.0,1.0,0.0,415.0,8.92,65.0]    |null |
+-----------------------------------------+-----+
only showing top 5 rows

