# Machine Learning with PySpark

Spark is a powerful, general purpose tool for working with Big Data. Spark transparently handles the distribution of compute tasks across a cluster. This means that operations are fast, but it also allows you to focus on the analysis rather than worry about technical details. In this course you'll learn how to get data into Spark and then delve into the three fundamental Spark Machine Learning algorithms: Linear Regression, Logistic Regression/Classifiers, and creating pipelines. Along the way you'll analyse a large dataset of flight delays and spam text messages. With this background you'll be ready to harness the power of Spark and apply it on your own Machine Learning projects!

## Table of Contents

- [Introduction & Data Loading](#intro)
- [Classification](#class)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

path = "data/dc35/"

In [3]:
from pyspark import SparkContext
sc = SparkContext("local", "First App")
print(sc)

<SparkContext master=local appName=First App>


In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('First App').getOrCreate()

In [5]:
# Return spark version
print(spark.version)

# Return python version
import sys
print(sys.version_info)

2.4.4
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)


---
<a id='intro'></a>

## Machine Learning & Spark

<img src="images/spark5_001.png" alt="" style="width: 800px;"/>

<img src="images/spark5_002.png" alt="" style="width: 800px;"/>

<img src="images/spark5_003.png" alt="" style="width: 800px;"/>

<img src="images/spark5_004.png" alt="" style="width: 800px;"/>

<img src="images/spark5_005.png" alt="" style="width: 800px;"/>

## Connecting to Spark

<img src="images/spark5_006.png" alt="" style="width: 800px;"/>

<img src="images/spark5_007.png" alt="" style="width: 800px;"/>

<img src="images/spark5_008.png" alt="" style="width: 800px;"/>

<img src="images/spark5_009.png" alt="" style="width: 800px;"/>

<img src="images/spark5_010.png" alt="" style="width: 800px;"/>

## Creating a SparkSession

In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a `SparkSession` object.

The SparkSession class has a `builder` attribute, which is an instance of the Builder class. The Builder class exposes three important methods that let you:

- specify the location of the master node;
- name the application (optional); and
- retrieve an existing SparkSession or, if there is none, create a new one.

The SparkSession class has a `version` attribute which gives the version of Spark.

Find out more about SparkSession [here](https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession).

Once you are finished with the cluster, it's a good idea to shut it down, which will free up its resources, making them available for other processes.

- Import the SparkSession class from pyspark.sql.
- Create a SparkSession object connected to a local cluster. Use all available cores. Name the application 'test'.
- Use the SparkSession object to retrieve the version of Spark running on the cluster.
- Shut down the cluster.

In [2]:
# Import the PySpark module
from pyspark.sql import SparkSession

# Create SparkSession object
spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('test') \
                    .getOrCreate()

# What version of Spark?
print(spark.version)

# Terminate the cluster
spark.stop()

2.4.4


## Loading Data

<img src="images/spark5_011.png" alt="" style="width: 800px;"/>

<img src="images/spark5_012.png" alt="" style="width: 800px;"/>

<img src="images/spark5_013.png" alt="" style="width: 800px;"/>

<img src="images/spark5_014.png" alt="" style="width: 800px;"/>

<img src="images/spark5_015.png" alt="" style="width: 800px;"/>

<img src="images/spark5_016.png" alt="" style="width: 800px;"/>

<img src="images/spark5_017.png" alt="" style="width: 800px;"/>

<img src="images/spark5_018.png" alt="" style="width: 800px;"/>

<img src="images/spark5_019.png" alt="" style="width: 800px;"/>

## Loading flights data

In this exercise you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format [here]().

Notes on CSV format:

- fields are separated by a comma (this is the default separator) and
- missing data are denoted by the string 'NA'.

Data dictionary:

- mon — month (integer between 1 and 12)
- dom — day of month (integer between 1 and 31)
- dow — day of week (integer; 1 = Monday and 7 = Sunday)
- org — origin airport (IATA code)
- mile — distance (miles)
- carrier — carrier (IATA code)
- depart — departure time (decimal hour)
- duration — expected duration (minutes)
- delay — delay (minutes)

pyspark has been imported for you and the session has been initialized.

- Read data from a CSV file called 'flights.csv'. Assign data types to columns automatically. Deal with missing data.
- How many records are in the data?
- Take a look at the first five records.
- What data types have been assigned to the columns? Do these look correct?

In [7]:
# Read data from CSV file
flights = spark.read.csv(path+'flights-larger.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,
                         nullValue='NA')

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
flights.dtypes

The data contain 275000 records.
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 10| 10|  1|     OO|  5836|ORD| 157|  8.18|      51|   27|
|  1|  4|  1|     OO|  5866|ORD| 466|  15.5|     102| null|
| 11| 22|  1|     OO|  6016|ORD| 738|  7.17|     127|  -19|
|  2| 14|  5|     B6|   199|JFK|2248| 21.17|     365|   60|
|  5| 25|  3|     WN|  1675|SJC| 386| 12.92|      85|   22|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows



[('mon', 'int'),
 ('dom', 'int'),
 ('dow', 'int'),
 ('carrier', 'string'),
 ('flight', 'int'),
 ('org', 'string'),
 ('mile', 'int'),
 ('depart', 'double'),
 ('duration', 'int'),
 ('delay', 'int')]

## Loading SMS spam data

You've seen that it's possible to infer data types directly from the data. Sometimes it's convenient to have direct control over the column types. You do this by defining an explicit schema.

The file sms.csv contains a selection of SMS messages which have been classified as either 'spam' or 'ham'. These data have been adapted from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). There are a total of 5574 SMS, of which 747 have been labelled as spam.

Notes on CSV format:

- no header record and
- fields are separated by a semicolon (this is not the default separator).

Data dictionary:

- id — record identifier
- text — content of SMS message
- label — spam or ham (integer; 0 = ham and 1 = spam)

Instructions

- Specify the data schema, giving columns names ("id", "text", and "label") and column types.
- Read data from a delimited file called "sms.csv".
- Print the schema for the resulting DataFrame.

In [9]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Load data from a delimited file
sms = spark.read.csv(path+'sms.csv', sep=';', header=False, schema=schema)

# Print schema of DataFrame
sms.printSchema()

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)



---
<a id='class'></a>

## Classification

## Data Preparation

<img src="images/spark5_020.png" alt="" style="width: 800px;"/>

<img src="images/spark5_021.png" alt="" style="width: 800px;"/>

<img src="images/spark5_022.png" alt="" style="width: 800px;"/>

<img src="images/spark5_023.png" alt="" style="width: 800px;"/>

<img src="images/spark5_024.png" alt="" style="width: 800px;"/>

<img src="images/spark5_025.png" alt="" style="width: 800px;"/>

<img src="images/spark5_026.png" alt="" style="width: 800px;"/>

## Removing columns and rows

You previously loaded airline flight data from a CSV file. You're going to develop a model which will predict whether or not a given flight will be delayed.

In this exercise you need to trim those data down by:

- removing an uninformative column and
- removing rows which do not have information about whether or not a flight was delayed.

Instructions

- Remove the flight column.
- Find out how many records have missing values in the delay column.
- Remove records with missing values in the delay column.
- Remove records with missing values in any column and get the number of remaining rows.

In [10]:
# Remove the 'flight' column
flights = flights.drop('flight')

# Number of records with missing 'delay' values
flights.filter('delay IS NULL').count()

# Remove records with missing 'delay' values
flights = flights.filter('delay IS NOT NULL')

# Remove records with missing values in any column and get the number of remaining rows
flights = flights.dropna()
print(flights.count())

258289


## Column manipulation

The Federal Aviation Administration (FAA) considers a flight to be "delayed" when it arrives 15 minutes or more after its scheduled time.

The next step of preparing the flight data has two parts:

1. convert the units of distance, replacing the mile column with a km column; and
1. create a Boolean column indicating whether or not a flight was delayed.

Instructions

- Import a function which will allow you to round a number to a specific number of decimal places.
- Derive a new km column from the mile column, rounding to zero decimal places. One mile is 1,60934 km.
- Remove the mile column.
- Create a label column with a value of 1 indicating the delay was 15 minutes or more and 0 otherwise.

In [11]:
# Import the required function
from pyspark.sql.functions import round

# Convert 'mile' to 'km' and drop 'mile' column
flights_km = flights.withColumn('km', round(flights.mile * 1.60934, 0)) \
                    .drop('mile')

# Create 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.withColumn('label', (flights_km.delay >= 15).cast('integer'))

# Check first five records
flights_km.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|
+---+---+---+-------+---+------+--------+-----+------+-----+
| 10| 10|  1|     OO|ORD|  8.18|      51|   27| 253.0|    1|
| 11| 22|  1|     OO|ORD|  7.17|     127|  -19|1188.0|    0|
|  2| 14|  5|     B6|JFK| 21.17|     365|   60|3618.0|    1|
|  5| 25|  3|     WN|SJC| 12.92|      85|   22| 621.0|    1|
|  3| 28|  1|     B6|LGA| 13.33|     182|   70|1732.0|    1|
+---+---+---+-------+---+------+--------+-----+------+-----+
only showing top 5 rows



## Categorical columns

In the flights data there are two columns, carrier and org, which hold categorical data. You need to transform those columns into indexed numerical values.

- Import the appropriate class and create an indexer object to transform the carrier column from a string to an numeric index.
- Prepare the indexer object on the flight data.
- Use the prepared indexer to create the numeric index column.
- Repeat the process for the org column.

In [17]:
from pyspark.ml.feature import StringIndexer

# Create an indexer
indexer = StringIndexer(inputCol='carrier', outputCol='carrier_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(flights_km)

# Indexer creates a new column with numeric index values
flights_indexed = indexer_model.transform(flights_km)

# Repeat the process for the other categorical feature
flights_indexed = StringIndexer(inputCol='org', outputCol='org_idx').fit(flights_indexed).transform(flights_indexed)

In [18]:
flights_indexed.show(10)

+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
| 10| 10|  1|     OO|ORD|  8.18|      51|   27| 253.0|    1|        2.0|    0.0|
| 11| 22|  1|     OO|ORD|  7.17|     127|  -19|1188.0|    0|        2.0|    0.0|
|  2| 14|  5|     B6|JFK| 21.17|     365|   60|3618.0|    1|        4.0|    2.0|
|  5| 25|  3|     WN|SJC| 12.92|      85|   22| 621.0|    1|        3.0|    5.0|
|  3| 28|  1|     B6|LGA| 13.33|     182|   70|1732.0|    1|        4.0|    3.0|
|  5| 28|  6|     B6|ORD|  9.58|     130|   47|1191.0|    1|        4.0|    0.0|
|  1| 19|  2|     UA|SFO| 12.75|     123|  135|1093.0|    1|        0.0|    1.0|
|  8|  5|  5|     US|LGA|  13.0|      71|  -10| 344.0|    0|        6.0|    3.0|
|  5| 27|  5|     AA|ORD| 14.42|     195|  -11|1926.0|    0|        1.0|    0.0|
|  8| 20|  6|     B6|JFK| 14

## Assembling columns

The final stage of data preparation is to consolidate all of the predictor columns into a single column.

At present our data has the following predictor columns:

- mon, dom and dow
- carrier_idx (derived from carrier)
- org_idx (derived from org)
- km
- depart
- duration

Instructions

- Import the class which will assemble the predictors.
- Create an assembler object that will allow you to merge the predictors columns into a single column.
- Use the assembler to generate a new consolidated column.

In [19]:
# Import the necessary class
from pyspark.ml.feature import VectorAssembler

# Create an assembler object
assembler = VectorAssembler(inputCols=['mon', 'dom', 'dow', 'carrier_idx', 'org_idx', 'km', 'depart', 'duration'], outputCol='features')

# Consolidate predictor columns
flights_assembled = assembler.transform(flights_indexed)

# Check the resulting column
flights_assembled.select('features', 'delay').show(5, truncate=False)

+-----------------------------------------+-----+
|features                                 |delay|
+-----------------------------------------+-----+
|[10.0,10.0,1.0,2.0,0.0,253.0,8.18,51.0]  |27   |
|[11.0,22.0,1.0,2.0,0.0,1188.0,7.17,127.0]|-19  |
|[2.0,14.0,5.0,4.0,2.0,3618.0,21.17,365.0]|60   |
|[5.0,25.0,3.0,3.0,5.0,621.0,12.92,85.0]  |22   |
|[3.0,28.0,1.0,4.0,3.0,1732.0,13.33,182.0]|70   |
+-----------------------------------------+-----+
only showing top 5 rows



## Decision Tree

<img src="images/spark5_027.png" alt="" style="width: 800px;"/>

<img src="images/spark5_028.png" alt="" style="width: 800px;"/>

<img src="images/spark5_029.png" alt="" style="width: 800px;"/>

<img src="images/spark5_030.png" alt="" style="width: 800px;"/>

<img src="images/spark5_031.png" alt="" style="width: 800px;"/>

## 

In [None]:
<img src="images/spark5_032.png" alt="" style="width: 800px;"/>

In [None]:
---
<a id='intro'></a>