# Getting started with machine learning pipelines

PySpark has built-in, cutting-edge machine learning routines, along with utilities to create full machine learning pipelines. You'll learn about them in this chapter.

## Preparing the environment

### Importing libraries

In [1]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.sql.types import (_parse_datatype_string, StructType, StructField,
                               DoubleType, IntegerType, StringType)
from pyspark.sql import SparkSession

### Connect to Spark

In [2]:
spark = SparkSession.builder.getOrCreate()

# eval DataFrame in notebooks
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)

### Reading the data

In [3]:
schema_str = "year int, month int, day int, dep_time int, dep_delay int, arr_time int, " + \
             "arr_delay int, carrier string, tailnum string, flight int, origin string, " + \
             "dest string, air_time int, distance int, hour int, minute int"
customSchema = _parse_datatype_string(schema_str)
flights = spark.read.csv('data-sources/flights_small.csv', header=True, schema=schema_str)
flights.createOrReplaceTempView("flights")
flights.printSchema()
flights.limit(2)

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: integer (nullable = true)
 |-- dep_delay: integer (nullable = true)
 |-- arr_time: integer (nullable = true)
 |-- arr_delay: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minute: integer (nullable = true)



year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,tailnum,flight,origin,dest,air_time,distance,hour,minute
2014,12,8,658,-7,935,-5,VX,N846VA,1780,SEA,LAX,132,954,6,58
2014,1,22,1040,5,1505,5,AS,N559AS,851,SEA,HNL,360,2677,10,40


In [4]:
schema_str = "faa string, name string, lat double, lon double, alt int, tz int, dst string"
customSchema = _parse_datatype_string(schema_str)
airports = spark.read.schema(customSchema).csv('data-sources/airports.csv', header=True)
airports.createOrReplaceTempView("airports")
airports.printSchema()
airports.limit(2)

root
 |-- faa: string (nullable = true)
 |-- name: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- lon: double (nullable = true)
 |-- alt: integer (nullable = true)
 |-- tz: integer (nullable = true)
 |-- dst: string (nullable = true)



faa,name,lat,lon,alt,tz,dst
04G,Lansdowne Airport,41.1304722,-80.6195833,1044,-5,A
06A,Moton Field Munic...,32.4605722,-85.6800278,264,-5,A


In [5]:
customSchema = StructType([
    StructField("tailnum", StringType()),
    StructField("year", IntegerType()),
    StructField("type", StringType()),
    StructField("manufacturer", StringType()),
    StructField("model", StringType()),
    StructField("engines", IntegerType()),
    StructField("seats", IntegerType()),
    StructField("speed", DoubleType()),
    StructField("engine", StringType())
])
planes = (spark.read.schema(customSchema)
                    .format("csv")
                    .option("header", "true")
                    .load('data-sources/planes.csv'))
planes.createOrReplaceTempView("planes")
planes.printSchema()
planes.limit(2)

root
 |-- tailnum: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- engines: integer (nullable = true)
 |-- seats: integer (nullable = true)
 |-- speed: double (nullable = true)
 |-- engine: string (nullable = true)



tailnum,year,type,manufacturer,model,engines,seats,speed,engine
N102UW,1998,Fixed wing multi ...,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
N103US,1999,Fixed wing multi ...,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan


In [6]:
spark.catalog.listTables()

[Table(name='airports', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='flights', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='planes', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

## Machine Learning Pipelines

In the next two chapters you'll step through every stage of the machine learning pipeline, from data intake to model evaluation. Let's get to it!

At the core of the pyspark.ml module are the Transformer and Estimator classes. Almost every other class in the module behaves similarly to these two basic classes.

Transformer classes have a .transform() method that takes a DataFrame and returns a new DataFrame; usually the original one with a new column appended. For example, you might use the class Bucketizer to create discrete bins from a continuous feature or the class PCA to reduce the dimensionality of your dataset using principal component analysis.

Estimator classes all implement a .fit() method. These methods also take a DataFrame, but instead of returning another DataFrame they return a model object. This can be something like a StringIndexerModel for including categorical data saved as strings in your models, or a RandomForestModel that uses the random forest algorithm for classification or regression.

## Ex.1 - Join the DataFrames

In the next two chapters you'll be working to build a model that predicts whether or not a flight will be delayed based on the `flights` data we've been working with. This model will also include information about the plane that flew the route, so the first step is to join the two tables: `flights` and `planes`!

**Instructions:**

1. First, rename the `year` column of `planes` to `plane_year` to avoid duplicate column names.
2. Create a new DataFrame called `model_data` by joining the `flights` table with `planes` using the `tailnum` column as the key.

In [7]:
# Rename year column
planes = planes.withColumnRenamed('year', 'plane_year')

# Join the DataFrames
model_data = flights.join(planes, on='tailnum', how="leftouter")
model_data.limit(2)

tailnum,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,flight,origin,dest,air_time,distance,hour,minute,plane_year,type,manufacturer,model,engines,seats,speed,engine
N846VA,2014,12,8,658,-7,935,-5,VX,1780,SEA,LAX,132,954,6,58,2011,Fixed wing multi ...,AIRBUS,A320-214,2,182,,Turbo-fan
N559AS,2014,1,22,1040,5,1505,5,AS,851,SEA,HNL,360,2677,10,40,2006,Fixed wing multi ...,BOEING,737-890,2,149,,Turbo-fan


## Data types

Before you get started modeling, it's important to know that Spark only handles numeric data. That means all of the columns in your DataFrame must be either integers or decimals (called 'doubles' in Spark).

When we imported our data, we let Spark guess what kind of information each column held. Unfortunately, Spark doesn't always guess right and you can see that some of the columns in our DataFrame are strings containing numbers as opposed to actual numeric values.

To remedy this, you can use the `.cast()` method in combination with the `.withColumn()` method. It's important to note that `.cast()` works on columns, while `.withColumn()` works on DataFrames.

The only argument you need to pass to `.cast()` is the kind of value you want to create, in string form. For example, to create integers, you'll pass the argument `"integer"` and for decimal numbers you'll use `"double"`.

You can put this call to `.cast()` inside a call to `.withColumn()` to overwrite the already existing column.

## Ex.2 - String to integer

Now you'll use the `.cast()` method you learned in the previous exercise to convert all the appropriate columns from your DataFrame flights_temp to integers!

To convert the type of a column using the `.cast()` method, you can write code like this:

`dataframe = dataframe.withColumn("col", dataframe.col.cast("new_type"))`

**Instructions:**
1. Use the method `.withColumn()` to `.cast()` the following columns to type `"integer"`. Access the columns using the `df.col` notation:
    - `model_data.dep_time`
    - `model_data.dep_delay`
    - `model_data.arr_time`
    - `model_data.arr_delay`
    - `model_data.air_time`
    - `model_data.hour`
    - `model_data.minute`

In [8]:
flights_temp = spark.read.csv('data-sources/flights_small.csv', header=True, inferSchema=True)
flights_temp.createOrReplaceTempView("flights")
flights_temp.printSchema()
flights_temp.limit(2)

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)



year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,tailnum,flight,origin,dest,air_time,distance,hour,minute
2014,12,8,658,-7,935,-5,VX,N846VA,1780,SEA,LAX,132,954,6,58
2014,1,22,1040,5,1505,5,AS,N559AS,851,SEA,HNL,360,2677,10,40


In [9]:
# Cast the columns to integers
flights_temp = flights_temp.withColumn("dep_time", flights_temp.dep_time.cast('integer'))
flights_temp = flights_temp.withColumn("dep_delay", flights_temp.dep_delay.cast('integer'))
flights_temp = flights_temp.withColumn("arr_time", flights_temp.arr_time.cast('integer'))
flights_temp = flights_temp.withColumn("arr_delay", flights_temp.arr_delay.cast('integer'))
flights_temp = flights_temp.withColumn("air_time", flights_temp.air_time.cast('integer'))
flights_temp = flights_temp.withColumn("hour", flights_temp.hour.cast('integer'))
flights_temp = flights_temp.withColumn("minute", flights_temp.minute.cast('integer'))

flights_temp.printSchema()
flights_temp.limit(2)

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: integer (nullable = true)
 |-- dep_delay: integer (nullable = true)
 |-- arr_time: integer (nullable = true)
 |-- arr_delay: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minute: integer (nullable = true)



year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,tailnum,flight,origin,dest,air_time,distance,hour,minute
2014,12,8,658,-7,935,-5,VX,N846VA,1780,SEA,LAX,132,954,6,58
2014,1,22,1040,5,1505,5,AS,N559AS,851,SEA,HNL,360,2677,10,40


In [10]:
flights_temp = spark.read.csv('data-sources/flights_small.csv', header=True, inferSchema=True)
flights_temp.createOrReplaceTempView("flights")
flights_temp.printSchema()
flights_temp.limit(2)

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)



year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,tailnum,flight,origin,dest,air_time,distance,hour,minute
2014,12,8,658,-7,935,-5,VX,N846VA,1780,SEA,LAX,132,954,6,58
2014,1,22,1040,5,1505,5,AS,N559AS,851,SEA,HNL,360,2677,10,40


## Ex. 3 - Create a new column

The column `plane_year` holds the year each plane was manufactured. However, your model will use the planes' age, which is slightly different from the year it was made!

**Instructions:**
1. Create the column `plane_age` using the `.withColumn()` method and subtracting the year of manufacture (column `plane_year`) from the year (column `year`) of the flight.

In [11]:
model_data.limit(2)

tailnum,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,flight,origin,dest,air_time,distance,hour,minute,plane_year,type,manufacturer,model,engines,seats,speed,engine
N846VA,2014,12,8,658,-7,935,-5,VX,1780,SEA,LAX,132,954,6,58,2011,Fixed wing multi ...,AIRBUS,A320-214,2,182,,Turbo-fan
N559AS,2014,1,22,1040,5,1505,5,AS,851,SEA,HNL,360,2677,10,40,2006,Fixed wing multi ...,BOEING,737-890,2,149,,Turbo-fan


In [12]:
# Create the column plane_age
model_data = model_data.withColumn("plane_age", model_data.year - model_data.plane_year)
model_data.limit(2)

tailnum,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,flight,origin,dest,air_time,distance,hour,minute,plane_year,type,manufacturer,model,engines,seats,speed,engine,plane_age
N846VA,2014,12,8,658,-7,935,-5,VX,1780,SEA,LAX,132,954,6,58,2011,Fixed wing multi ...,AIRBUS,A320-214,2,182,,Turbo-fan,3
N559AS,2014,1,22,1040,5,1505,5,AS,851,SEA,HNL,360,2677,10,40,2006,Fixed wing multi ...,BOEING,737-890,2,149,,Turbo-fan,8


## Ex. 4 - Making a Boolean

Consider that you're modeling a yes or no question: is the flight late? However, your data contains the arrival delay in minutes for each flight. Thus, you'll need to create a boolean column which indicates whether the flight was late or not!

**Instructions:**

1. Use the `.withColumn()` method to create the column `is_late`. This column is equal to `model_data.arr_delay > 0`.
2. Convert this column to an integer column so that you can use it in your model and name it label (this is the default name for the response variable in Spark's machine learning routines).
3. Filter out missing values.

In [13]:
model_data.count()

10000

In [14]:
# Create is_late
model_data = model_data.withColumn("is_late", model_data.arr_delay > 0)
model_data.printSchema()

root
 |-- tailnum: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: integer (nullable = true)
 |-- dep_delay: integer (nullable = true)
 |-- arr_time: integer (nullable = true)
 |-- arr_delay: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minute: integer (nullable = true)
 |-- plane_year: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- engines: integer (nullable = true)
 |-- seats: integer (nullable = true)
 |-- speed: double (nullable = true)
 |-- engine: string (nullable = true)
 |-- plane_age: integer (nullable = true)
 |-- is_late: 

In [15]:
# Convert to an integer
model_data = model_data.withColumn("label", model_data.is_late.cast('integer'))

# Remove missing values
model_data = model_data.filter("arr_delay is not NULL and "
                               "dep_delay is not NULL and "
                               "air_time is not NULL and "
                               "plane_year is not NULL")
model_data.printSchema()
print('Total rows: ', model_data.count())
model_data.limit(2)

root
 |-- tailnum: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: integer (nullable = true)
 |-- dep_delay: integer (nullable = true)
 |-- arr_time: integer (nullable = true)
 |-- arr_delay: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minute: integer (nullable = true)
 |-- plane_year: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- engines: integer (nullable = true)
 |-- seats: integer (nullable = true)
 |-- speed: double (nullable = true)
 |-- engine: string (nullable = true)
 |-- plane_age: integer (nullable = true)
 |-- is_late: 

tailnum,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,flight,origin,dest,air_time,distance,hour,minute,plane_year,type,manufacturer,model,engines,seats,speed,engine,plane_age,is_late,label
N846VA,2014,12,8,658,-7,935,-5,VX,1780,SEA,LAX,132,954,6,58,2011,Fixed wing multi ...,AIRBUS,A320-214,2,182,,Turbo-fan,3,False,0
N559AS,2014,1,22,1040,5,1505,5,AS,851,SEA,HNL,360,2677,10,40,2006,Fixed wing multi ...,BOEING,737-890,2,149,,Turbo-fan,8,True,1


## Strings and factors

As you know, Spark requires numeric data for modeling. So far this hasn't been an issue; even boolean columns can easily be converted to integers without any trouble. But you'll also be using the airline and the plane's destination as features in your model. These are coded as strings and there isn't any obvious way to convert them to a numeric data type.

Fortunately, `PySpark` has functions for handling this built into the `pyspark.ml.features submodule`. You can create what are called `'one-hot vectors'` to represent the carrier and the destination of each flight. A one-hot vector is a way of representing a categorical feature where every observation has a vector in which all elements are zero except for at most one element, which has a value of one (1).

Each element in the vector corresponds to a level of the feature, so it's possible to tell what the right level is by seeing which element of the vector is equal to one (1).

The first step to encoding your categorical feature is to create a `StringIndexer`. Members of this class are Estimators that take a DataFrame with a column of strings and map each unique string to a number. Then, the Estimator returns a Transformer that takes a DataFrame, attaches the mapping to it as metadata, and returns a new DataFrame with a numeric column corresponding to the string column.

The second step is to encode this numeric column as a one-hot vector using a `OneHotEncoder`. This works exactly the same way as the `StringIndexer` by creating an `Estimator` and then a `Transformer`. The end result is a column that encodes your categorical feature as a vector that's suitable for machine learning routines!

All you have to remember is that you need to create a `StringIndexer` and a `OneHotEncoder`, and the Pipeline will take care of the rest.

## Ex. 5 - Carrier

In this exercise you'll create a `StringIndexer` and a `OneHotEncoder` to code the carrier column. To do this, you'll call the class constructors with the arguments `inputCol` and `outputCol`.

The `inputCol` is the name of the column you want to index or encode, and the `outputCol` is the name of the new column that the Transformer should create.

**Instructions:**

1. Create a `StringIndexer` called `carr_indexer` by calling `StringIndexer()` with `inputCol="carrier"` and `outputCol="carrier_index"`.
2. Create a `OneHotEncoder` called `carr_encoder` by calling `OneHotEncoder()` with `inputCol="carrier_index"` and `outputCol="carrier_fact"`.

In [16]:
model_data.limit(2)

tailnum,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,flight,origin,dest,air_time,distance,hour,minute,plane_year,type,manufacturer,model,engines,seats,speed,engine,plane_age,is_late,label
N846VA,2014,12,8,658,-7,935,-5,VX,1780,SEA,LAX,132,954,6,58,2011,Fixed wing multi ...,AIRBUS,A320-214,2,182,,Turbo-fan,3,False,0
N559AS,2014,1,22,1040,5,1505,5,AS,851,SEA,HNL,360,2677,10,40,2006,Fixed wing multi ...,BOEING,737-890,2,149,,Turbo-fan,8,True,1


In [17]:
# Create a StringIndexer
carr_indexer = StringIndexer(inputCol="carrier", outputCol="carrier_index")
carr_indexer

StringIndexer_316ff4f0b606

In [18]:
# Create a OneHotEncoder
carr_encoder = OneHotEncoder(inputCol="carrier_index", outputCol="carrier_fact")
carr_encoder

OneHotEncoder_f468bc57942a

## Ex. 6 - Destination

Now you'll encode the dest column just like you did in the previous exercise.

**Instructions:**

1. Create a `StringIndexer` called `dest_indexer` by calling `StringIndexer()` with `inputCol="dest"` and `outputCol="dest_index"`.
2. Create a `OneHotEncoder` called `dest_encoder` by calling `OneHotEncoder()` with `inputCol="dest_index"` and `outputCol="dest_fact"`.

In [19]:
# Create a StringIndexer
dest_indexer = StringIndexer(inputCol="dest", outputCol="dest_index")
dest_indexer

StringIndexer_e2b0e6a917f8

In [20]:
# Create a OneHotEncoder
dest_encoder = OneHotEncoder(inputCol="dest_index", outputCol="dest_fact")
dest_encoder

OneHotEncoder_5879e791ce6e

## Ex. 7 - Assemble a vector

The last step in the Pipeline is to combine all of the columns containing our features into a single column. This has to be done before modeling can take place because every Spark modeling routine expects the data to be in this form. You can do this by storing each of the values from a column as an entry in a vector. Then, from the model's point of view, every observation is a vector that contains all of the information about it and a label that tells the modeler what value that observation corresponds to.

Because of this, the `pyspark.ml.feature` submodule contains a class called `VectorAssembler`. This Transformer takes all of the columns you specify and combines them into a new vector column.

**Instructions:**

1. Create a `VectorAssembler` by calling `VectorAssembler()` with the `inputCols` names as a list and the `outputCol` name `"features"`.
2. The list of columns should be ["month", "air_time", "carrier_fact", "dest_fact", "plane_age"].

In [21]:
# Make a VectorAssembler
vec_assembler = VectorAssembler(
    inputCols=["month", "air_time", "carrier_fact", "dest_fact", "plane_age"], 
    outputCol='features'
)
vec_assembler

VectorAssembler_34847e9092cd

## Ex. 8 - Create the pipeline

`Pipeline` is a class in the `pyspark.ml` module that combines all the Estimators and Transformers that you've already created. This lets you reuse the same modeling process over and over again by wrapping it up in one simple object.

**Instructions:**

1. Import `Pipeline` from `pyspark.ml`. (Already done!)
2. Call the `Pipeline()` constructor with the keyword argument `stages` to create a Pipeline called `flights_pipe`.
`stages` should be a list holding all the stages you want your data to go through in the pipeline. Here this is just: `[dest_indexer, dest_encoder, carr_indexer, carr_encoder, vec_assembler]`

In [22]:
# Make the pipeline
flights_pipe = Pipeline(
    stages=[dest_indexer, dest_encoder, carr_indexer, carr_encoder, vec_assembler]
)
flights_pipe

Pipeline_616aeb233b4f

## Test vs. Train

After you've cleaned your data and gotten it ready for modeling, one of the most important steps is to split the data into a test set and a train set. After that, don't touch your test data until you think you have a good model! As you're building models and forming hypotheses, you can test them on your training data to get an idea of their performance.

Once you've got your favorite model, you can see how well it predicts the new data in your test set. This never-before-seen data will give you a much more realistic idea of your model's performance in the real world when you're trying to predict or classify new data.

In Spark it's important to make sure you split the data after all the transformations. This is because operations like `StringIndexer` don't always produce the same index even when given the same list of strings.

## Ex. 9 - Transform the data

Hooray, now you're finally ready to pass your data through the Pipeline you created!

**Instructions:**
1. Create the DataFrame `piped_data` by calling the Pipeline methods `.fit()` and `.transform()` in a chain. Both of these methods take `model_data` as their only argument.

In [23]:
# Fit and transform the data
piped_data = flights_pipe.fit(model_data).transform(model_data)
piped_data.limit(2)

tailnum,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,flight,origin,dest,air_time,distance,hour,minute,plane_year,type,manufacturer,model,engines,seats,speed,engine,plane_age,is_late,label,dest_index,dest_fact,carrier_index,carrier_fact,features
N846VA,2014,12,8,658,-7,935,-5,VX,1780,SEA,LAX,132,954,6,58,2011,Fixed wing multi ...,AIRBUS,A320-214,2,182,,Turbo-fan,3,False,0,1.0,"(68,[1],[1.0])",7.0,"(10,[7],[1.0])","(81,[0,1,9,13,80]..."
N559AS,2014,1,22,1040,5,1505,5,AS,851,SEA,HNL,360,2677,10,40,2006,Fixed wing multi ...,BOEING,737-890,2,149,,Turbo-fan,8,True,1,19.0,"(68,[19],[1.0])",0.0,"(10,[0],[1.0])","(81,[0,1,2,31,80]..."


## Ex. 10 - Split the data

Now that you've done all your manipulations, the last step before modeling is to split the data!

**Instructions:**

1. Use the DataFrame method `.randomSplit()` to split `piped_data` into two pieces, `training` with `60%` of the data, and `test` with `40%` of the data by passing the list `[.6, .4]` to the `.randomSplit()` method.

In [24]:
# Split the data into training and test sets
training, test = piped_data.randomSplit([.6, .4])

print(f'Training set: {training.count()} rows.')
print(f'Testing set : {test.count()} rows.')

Training set: 5547 rows.
Testing set : 3756 rows.


## Close

In [25]:
spark.stop()