<a href="https://colab.research.google.com/github/pratikgujral/learn-PySpark/blob/master/Learn_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we'll learn how to use Spark using Python!

# Prerequisites
- Familiarity with Python


# Introduction
![Spark logo](https://spark.apache.org/docs/latest/api/python/_static/spark-logo-hd.png)

Spark is a tool for doing parallel computation with large datasets. Spark integrates with Python very well. PySpark is the Python package that makes the magic happen. You'll use this package to work with data about flights from Portland and Seattle. You'll learn to wrangle this data and build a whole machine learning pipeline to predict whether or not flights will be delayed. Get ready to put some Spark in your Python code and dive into the world of high-performance machine learning!

# Part 1: Getting to know PySpark
In this part, we'll learn how Spark manages data and how can you read and write tables from Python.

## What is Spark?
Spark is a platform for **cluster computing**. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

![Cluster Computing](https://i.pinimg.com/originals/bf/c4/17/bfc4173a6e383fb815935fad9a8d9c11.png)

As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.

However, with greater computing power comes greater complexity.

Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:
 - Is my data too big to work with on a single machine?
 - Can my calculations be easily parallelized?

## Using Spark in Python
The first step in using Spark is connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master (also sometimes called a **frontend node**) that manages splitting up the data and the computations. The **master** is connected to the rest of the computers in the cluster, which are called **worker**. The master sends the workers data and calculations to run, and they send their results back to the master.

When you're just getting started with Spark it's simpler to just run a cluster locally. Thus, for this course, instead of connecting to another computer, all computations will be run on DataCamp's servers in a simulated cluster.

Creating the connection is as simple as creating an instance of the `SparkContext` class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.

An object holding all these attributes can be created with the `SparkConf()` constructor. Take a look at the documentation for all the details!

For the rest of this course you'll have a `SparkContext` called sc already available in your workspace.



## Installing and setting up Spark
Run the below code. If the it throws an error, update the link to the Apache Spark dist in the wget and later while setting the environment variables

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

Now that we have installed Spark and Java in Colab, it is time to set the environment path that enables us to run PySpark in our Colab environment. Set the location of Java and Spark by running the following code:

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

Running a local Spark Session

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

## Reading the datasets

In [0]:
url_airports_dataset = r"https://assets.datacamp.com/production/repositories/1237/datasets/6e5c4ac2a4799338ba7e13d54ce1fa918da644ba/airports.csv"
url_flights_dataset = r"https://assets.datacamp.com/production/repositories/1237/datasets/fa47bb54e83abd422831cbd4f441bd30fd18bd15/flights_small.csv"
url_planes_datasets = r"https://assets.datacamp.com/production/repositories/1237/datasets/231480a2696c55fde829ce76d936596123f12c0c/planes.csv"

In [5]:
!wget --no-check-certificate "https://assets.datacamp.com/production/repositories/1237/datasets/6e5c4ac2a4799338ba7e13d54ce1fa918da644ba/airports.csv" -O /tmp/airports.csv
!wget --no-check-certificate "https://assets.datacamp.com/production/repositories/1237/datasets/fa47bb54e83abd422831cbd4f441bd30fd18bd15/flights_small.csv" -O /tmp/flights.csv
!wget --no-check-certificate "https://assets.datacamp.com/production/repositories/1237/datasets/231480a2696c55fde829ce76d936596123f12c0c/planes.csv" -O /tmp/planes.csv

--2020-02-10 10:09:53--  https://assets.datacamp.com/production/repositories/1237/datasets/6e5c4ac2a4799338ba7e13d54ce1fa918da644ba/airports.csv
Resolving assets.datacamp.com (assets.datacamp.com)... 99.86.33.55, 99.86.33.27, 99.86.33.91, ...
Connecting to assets.datacamp.com (assets.datacamp.com)|99.86.33.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84548 (83K)
Saving to: ‘/tmp/airports.csv’


2020-02-10 10:09:53 (7.24 MB/s) - ‘/tmp/airports.csv’ saved [84548/84548]

--2020-02-10 10:09:56--  https://assets.datacamp.com/production/repositories/1237/datasets/fa47bb54e83abd422831cbd4f441bd30fd18bd15/flights_small.csv
Resolving assets.datacamp.com (assets.datacamp.com)... 99.86.33.55, 99.86.33.27, 99.86.33.91, ...
Connecting to assets.datacamp.com (assets.datacamp.com)|99.86.33.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 614174 (600K)
Saving to: ‘/tmp/flights.csv’


2020-02-10 10:09:56 (17.3 MB/s) - ‘/tmp/flights.csv’ saved 

## Using DataFrames
Spark's core data structure is the **Resilient Distributed Dataset (RDD)**. This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster. However, RDDs are hard to work with directly, so in this notebook we'll be using the Spark DataFrame abstraction built on top of RDDs.

The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs.

When you start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others. When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!

To start working with Spark DataFrames, you first have to create a `SparkSession` object from your `SparkContext`. You can think of the `SparkContext` as your connection to the cluster and the `SparkSession` as your interface with that connection.

Remember, for the rest of the notebook, we'll have a `SparkSession` called `spark` available in our workspace!

## Creating a SparkSession
We'll have a `SparkSession` called `spark`, but what if you're not sure there already is one? Creating multiple SparkSessions and SparkContexts can cause issues, so it's best practice to use the `SparkSession.builder.getOrCreate()` method. This returns an existing `SparkSession` if there's already one in the environment, or creates a new one if necessary!

In [6]:
from pyspark.sql import SparkSession

# Creating a new PySpark Session
my_spark = SparkSession.builder.getOrCreate()
print(my_spark)

<pyspark.sql.session.SparkSession object at 0x7f2f88833630>


---

## Reading a CSV into a spark DataFrame

In [7]:
flights = spark.read.format("csv").option("header", "true").load('/tmp/flights.csv')
print(type(flights))

<class 'pyspark.sql.dataframe.DataFrame'>


### Printing top `n` rows of the Spark DataFrame
We use the `.show()` method on the DataFrame to fetch the top `n` rows. By default `n = 20`


In [8]:
flights.show(n=22)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

## Selecting data
### Selecting only specific column(s)

Passing the name of the column or a list of column
 names to the `.select()` method will return the desired columns only


In [9]:
flights.select(['carrier', 'origin']).show()

+-------+------+
|carrier|origin|
+-------+------+
|     VX|   SEA|
|     AS|   SEA|
|     VX|   SEA|
|     WN|   PDX|
|     AS|   SEA|
|     WN|   PDX|
|     WN|   PDX|
|     VX|   SEA|
|     AS|   SEA|
|     AS|   SEA|
|     AS|   SEA|
|     AS|   SEA|
|     AS|   SEA|
|     AS|   SEA|
|     AS|   SEA|
|     UA|   PDX|
|     AS|   SEA|
|     WN|   SEA|
|     AS|   SEA|
|     OO|   PDX|
+-------+------+
only showing top 20 rows



### Selecting everybody, but applying some transformation on the column
Selecting `carrier`, `origin` and `air_time` columns and incrementing the values in the `air_time` by 50

In [10]:
flights.select(flights['carrier'], flights['origin'],flights['air_time'] + 50).show()

+-------+------+---------------+
|carrier|origin|(air_time + 50)|
+-------+------+---------------+
|     VX|   SEA|          182.0|
|     AS|   SEA|          410.0|
|     VX|   SEA|          161.0|
|     WN|   PDX|          133.0|
|     AS|   SEA|          177.0|
|     WN|   PDX|          171.0|
|     WN|   PDX|          140.0|
|     VX|   SEA|          148.0|
|     AS|   SEA|          185.0|
|     AS|   SEA|          248.0|
|     AS|   SEA|          180.0|
|     AS|   SEA|          204.0|
|     AS|   SEA|          177.0|
|     AS|   SEA|          233.0|
|     AS|   SEA|          179.0|
|     UA|   PDX|          140.0|
|     AS|   SEA|          126.0|
|     WN|   SEA|          266.0|
|     AS|   SEA|          340.0|
|     OO|   PDX|          161.0|
+-------+------+---------------+
only showing top 20 rows



### Selecting data based on some condition
We use the `.filter()` method on the Spark DataFrame and fetch the data based on the condition provided. 

In [11]:
flights.filter(flights['air_time'] > 200).show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    8| 11|    1017|       -3|    1613|       -7|     WN| N8634A|   827|   SEA| MDW|     216|    1733|  10|    17|
|2014|    1| 13|    2156|       -9|     607|      -15|     AS| N597AS|    24|   SEA| BOS|     290|    2496|  21|    56|
|2014|    9| 26|     610|       -5|    1523|       65|     US| N127UW|   616|   SEA| PHL|     293|    2378|   6|    10|
|2014|   12|  4|     954|       -6|    1348|      -17|     HA| N395HA|    29|   SEA| OGG|     333|    2640|   9|    54|
|2014|    6|  7|    1823|       -7|    2

In [12]:
flights.groupby('origin').count().show()

+------+-----+
|origin|count|
+------+-----+
|   SEA| 6754|
|   PDX| 3246|
+------+-----+



### Printing the schema

In [13]:
flights.printSchema()

root
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- distance: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)



## Creating Tables
The `.createDataFrame()` method takes a `pandas` DataFrame and returns a Spark DataFrame.

The output of this method is stored locally, not in the `SparkSession` `catalog`. This means that you can use all the Spark DataFrame methods on it, but you can't access the data in other contexts.

For example, a SQL query (using the `.sql()` method) that references your DataFrame will throw an error. To access the data in this way, you have to save it as a temporary table.

You can do this using the `.createTempView()` Spark DataFrame method, which takes as its only argument the name of the temporary table you'd like to register. This method registers the DataFrame as a table in the catalog, but as this table is temporary, it can only be accessed from the specific `SparkSession` used to create the Spark DataFrame.

There is also the method `.createOrReplaceTempView()`. This safely creates a new temporary table if nothing was there before, or updates an existing table if one was already defined. You'll use this method to avoid running into problems with duplicate tables.

Check out the diagram to see all the different ways your Spark data structures interact with each other.

![alt](https://s3.amazonaws.com/assets.datacamp.com/production/course_4452/datasets/spark_figure.png)

## Viewing Tables
Now that we have created a `SparkSession`, we can start poking around to see what data is in our cluster!

Our `SparkSession` has an attribute called `catalog` which lists all the data inside the cluster. This attribute has a few methods for extracting different pieces of information.

One of the most useful is the `.listTables()` method, which returns the names of all the tables in your cluster as a list.

In [14]:
print(spark.catalog.listTables()) # Output -> [Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

[]


This just printed an empty list. That is because even though we have a variable storing a Spark DataFrame in memory, we do not have any Tables or Views in the Spark context.


Creating a new 'temporary view' from the `flights` Spark DataFrame and adding it to the catalog by the name `'flights'`

In [0]:
flights.createOrReplaceTempView("flights")

Now when we list the tables present in the catalog, we should get our newly created temporary table in the results.

In [16]:
print(spark.catalog.listTables()) # Output -> [Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

[Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]


### Are you query-ious?
One of the advantages of the DataFrame interface is that you can run SQL queries on the tables in your Spark cluster.

As you saw in the last exercise, one of the tables in your cluster is the `flights` table. This table contains a row for every flight that left Portland International Airport (PDX) or Seattle-Tacoma International Airport (SEA) in 2014 and 2015.

Running a query on this table is as easy as using the `.sql()` method on your `SparkSession`. This method takes a string containing the query and returns a DataFrame with the results!

If you look closely, you'll notice that the table `flights` is only mentioned in the query, not as an argument to any of the methods. This is because there isn't a local object in your environment that holds that data, so it wouldn't make sense to pass the table as an argument.

Before that, we'll create a `SparkSession` called `spark`.

In [17]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Writing the query to fetch first 10 rows of flights
query = 'FROM flights SELECT * LIMIT 10'

# Executing the query using the .sql() method of SparkSession
flights10 = spark.sql(query)

# Using the DataFrame method .show() to print flights10
flights10.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1



---


### Panda-fying a Spark DataFrame

Suppose you've run a query on your huge dataset and aggregated it down to something a little more manageable.

Sometimes it makes sense to then take that table and work with it locally using a tool like `pandas`. Spark DataFrames make that easy with the `.toPandas()` method. Calling this method on a Spark DataFrame returns the corresponding `pandas` DataFrame. It's as simple as that!

This time the query counts the number of flights to each airport from SEA and PDX.

Ofcourse, we shall first create a `SparkSession` called `spark` in our workspace!

In [18]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

query = 'SELECT origin, dest, COUNT(*) AS N FROM flights GROUP BY origin, dest'

# Running the query
flight_counts = spark.sql(query)

# Conerting the resulting table into a Pandas DataFrame
pd_counts = flight_counts.toPandas()

# Print the head of pd_counts
print(pd_counts.head())

  origin dest    N
0    SEA  RNO    8
1    SEA  DTW   98
2    SEA  CLE    2
3    SEA  LAX  450
4    PDX  SEA  144


---
### Creating a Spark DataFrame from a Pandas DataFrame
Spark DataFrame can be created by calling the `spark.createDataFrame()` function with the `pandas` DataFrame as the argument

In [19]:
from pyspark.sql import SparkSession
import numpy as np
import pandas as pd

# Creating a temporary pandas DataFrame
pd_temp = pd.DataFrame(np.random.random(10))
pd_temp

Unnamed: 0,0
0,0.352623
1,0.030013
2,0.192089
3,0.29386
4,0.075973
5,0.913636
6,0.99657
7,0.818788
8,0.013662
9,0.675787


In [20]:
spark = SparkSession.builder.getOrCreate()

# Creating a Spark DataFrame called spark_temp by calling the .createDataFrame() method with pd_temp as the argument
spark_temp = spark.createDataFrame(pd_temp)

# Examing the tables in the catalog
print(spark.catalog.listTables()) # OUTPUT -> # [Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
# Note that it DOES NOT have the newly created spark_temp table as this is only created locally and not in the spark catalog

# Registering spark_temp as a temporary table names 'temp' using the createOrReplaceTempView() method. The name of the table is set by passing the desired name as an argument to the method
spark_temp.createOrReplaceTempView('temp')

# Examining the list of tables once again. This time, it will list our newly created temp DataFrame as a table
print(spark.catalog.listTables()) 
# OUTPUT-> [Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True), Table(name='temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

[Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
[Table(name='flights', database=None, description=None, tableType='TEMPORARY', isTemporary=True), Table(name='temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]


---


### Dropping the middle man
Now you know how to put data into Spark via `pandas`, but you're probably wondering why deal with `pandas` at all? Wouldn't it be easier to just read a text file straight into Spark? Of course it would!

Luckily, your `SparkSession` has a `.read` attribute which has several methods for reading different data sources into Spark DataFrames. Using these you can create a DataFrame from a .csv file just like with regular pandas DataFrames!

The variable `file_path` is a string with the path to the file `airports.csv`. This file contains information about different airports all over the world.

In [21]:
spark = SparkSession.builder.getOrCreate()

file_path = '/tmp/airports.csv'

# Reading the CSV file into a Spark DataFrame
airports = spark.read.csv(file_path, header=True)
print(type(airports)) # Note the Datatype is <class 'pyspark.sql.dataframe.DataFrame'>

# Show the data
airports.show()

<class 'pyspark.sql.dataframe.DataFrame'>
+---+--------------------+----------------+-----------------+----+---+---+
|faa|                name|             lat|              lon| alt| tz|dst|
+---+--------------------+----------------+-----------------+----+---+---+
|04G|   Lansdowne Airport|      41.1304722|      -80.6195833|1044| -5|  A|
|06A|Moton Field Munic...|      32.4605722|      -85.6800278| 264| -5|  A|
|06C| Schaumburg Regional|      41.9893408|      -88.1012428| 801| -6|  A|
|06N|     Randall Airport|       41.431912|      -74.3915611| 523| -5|  A|
|09J|Jekyll Island Air...|      31.0744722|      -81.4277778|  11| -4|  A|
|0A9|Elizabethton Muni...|      36.3712222|      -82.1734167|1593| -4|  A|
|0G6|Williams County A...|      41.4673056|      -84.5067778| 730| -5|  A|
|0G7|Finger Lakes Regi...|      42.8835647|      -76.7812318| 492| -5|  A|
|0P2|Shoestring Aviati...|      39.7948244|      -76.6471914|1000| -5|  U|
|0S9|Jefferson County ...|      48.0538086|     -122.81064



---



---



# Part 2: Manipulating data
In this chapter, we'll learn about the pyspark.sql module, which provides optimized data queries to our Spark session.

## Creating columns
In this chapter, you'll learn how to use the methods defined by Spark's DataFrame class to perform common data operations.

Let's look at performing column-wise operations. In Spark you can do this using the `.withColumn()` method, which takes two arguments. 

> 1) A string with the name of your new column

> 2) The new column itself

The new column must be an object of class `Column`. Creating one of these is as easy as extracting a column from your DataFrame using `df.colName`.

Updating a Spark DataFrame is somewhat different than working in `pandas` because the **Spark DataFrame is immutable**. This means that it can't be changed, and so columns can't be updated inplace.

Thus, all these methods return a **new** DataFrame. To overwrite the original DataFrame you must reassign the returned DataFrame using the method like so:

> `df = df.withColumn("newCol", df.oldCol + 1)`

The above code creates a DataFrame with the same columns as `df` plus a new column, `newCol`, where every entry is equal to the corresponding entry from `oldCol`, plus one.

To overwrite an existing column, just pass the name of that particular column as the first argument!

In [22]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Pulling the flights from the catalog to create a Spark DataFrame containing the values of the flights.
flights = spark.table('flights')

# Show the head using .show() method. The column air_time contains the duration of the flight in minutes.
flights.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

Verfying that the `spark.table()` method returns a Spark DataFrame

In [23]:
type(flights)

pyspark.sql.dataframe.DataFrame

### The `.withColumn()` method
We can manipulate a column, or add a new column to our Spark DataFrame by calling the `.withColumn()` method on the DataFrame.

Adding another column `duration_hrs` to our DataFrame whose value is derived from the `air_time` column...

In [24]:
flights = flights.withColumn('duration_hrs', flights.air_time / 60)

flights.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|      duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|               2.2|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|               6.0|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|              1.85|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|1.3833333333333333|
|2014|    3|  9|     754|  

---

### Filtering Data
Now that you have a bit of SQL know-how under your belt, it's easier to talk about the analogous operations using Spark DataFrames.

Let's take a look at the `.filter()` method. As you might suspect, this is the Spark counterpart of SQL's WHERE clause. The `.filter()` method takes either an expression that would follow the WHERE clause of a SQL expression as a string, or a Spark Column of boolean (True/False) values.

For example, the following two expressions will produce the same output:

> `flights.filter("air_time > 120").show()`

> `flights.filter(flights.air_time > 120).show()`

Notice that in the first case, we pass a **string** to `.filter()`. In SQL, we would write this filtering task as `SELECT * FROM flights WHERE air_time > 120`. Spark's `.filter()` can accept any expression that could go in the WHERE clause of a SQL query (in this case, `"air_time > 120"`), as long as it is passed as a string. Notice that in this case, we do not reference the name of the table in the string -- as we wouldn't in the SQL request.

In the second case, we actually pass a **column of boolean values** to `.filter()`. Remember that `flights.air_time > 120` returns a column of boolean values that has True in place of those records in `flights.air_time` that are over 120, and False otherwise.

Remember, a SparkSession called spark is already in your workspace, along with the Spark DataFrame flights.

In [25]:
# Filter flights by passing a string to filter() method to find all flights that flew over 1000 miles distance.
long_flights1 = flights.filter('distance > 1000')

# Filter flights by passing a column of boolean values
long_flights2 = flights.filter(flights.distance > 1000)

# Print the data to check they're equal
long_flights1.show()
long_flights2.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|      duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|               6.0|
|2014|    4| 19|    1236|       -4|    1508|       -7|     AS| N309AS|   490|   SEA| SAN|     135|    1050|  12|    36|              2.25|
|2014|   11| 19|    1812|       -3|    2352|       -4|     AS| N564AS|    26|   SEA| ORD|     198|    1721|  18|    12|               3.3|
|2014|    8|  3|    1120|        0|    1415|        2|     AS| N305AS|   656|   SEA| PHX|     154|    1107|  11|    20| 2.566666666666667|
|2014|   11| 12|    2346|  

---

### Selecting
The Spark variant of SQL's `SELECT` is the `.select()` method. This method takes multiple arguments - one for each column you want to select. These arguments can either be the column name as a string (one for each column) or a column object (using the `df.colName` syntax). When you pass a column object, you can perform operations like addition or subtraction on the column to change the data contained in it, much like inside `.withColumn()`.

#### Difference between `.select()` and `.withColumn()`
The difference between `.select()` and `.withColumn()` methods is that `.select()` returns only the columns you specify, while `.withColumn()` returns all the columns of the DataFrame in addition to the one you defined. It's often a good idea to drop columns you don't need at the beginning of an operation so that you're not dragging around extra data as you're wrangling. In this case, you would use `.select()` and not `.withColumn()`.


In [26]:
# BY STRINGS: Selecting the columns tailnum, origin, and dest from flights by passing the column names as strings.
selected1 = flights.select('tailnum', 'origin', 'dest')
selected1.show()

# BY BOOLEAN MASK: Selecting the columns origin, dest, and carrier using the df.colName
temp = flights.select(flights.origin, flights.dest, flights.carrier)
temp.show()

# Defining boolean filters and filtering data based on these filters
filterA = flights.origin == "SEA"
filterB = flights.dest == "PDX"
selected2  = flights.filter(filterA).filter(filterB)
selected2.show()

+-------+------+----+
|tailnum|origin|dest|
+-------+------+----+
| N846VA|   SEA| LAX|
| N559AS|   SEA| HNL|
| N847VA|   SEA| SFO|
| N360SW|   PDX| SJC|
| N612AS|   SEA| BUR|
| N646SW|   PDX| DEN|
| N422WN|   PDX| OAK|
| N361VA|   SEA| SFO|
| N309AS|   SEA| SAN|
| N564AS|   SEA| ORD|
| N323AS|   SEA| LAX|
| N305AS|   SEA| PHX|
| N433AS|   SEA| LAS|
| N765AS|   SEA| ANC|
| N713AS|   SEA| SFO|
| N27205|   PDX| SFO|
| N626AS|   SEA| SMF|
| N8634A|   SEA| MDW|
| N597AS|   SEA| BOS|
| N215AG|   PDX| BUR|
+-------+------+----+
only showing top 20 rows

+------+----+-------+
|origin|dest|carrier|
+------+----+-------+
|   SEA| LAX|     VX|
|   SEA| HNL|     AS|
|   SEA| SFO|     VX|
|   PDX| SJC|     WN|
|   SEA| BUR|     AS|
|   PDX| DEN|     WN|
|   PDX| OAK|     WN|
|   SEA| SFO|     VX|
|   SEA| SAN|     AS|
|   SEA| ORD|     AS|
|   SEA| LAX|     AS|
|   SEA| PHX|     AS|
|   SEA| LAS|     AS|
|   SEA| ANC|     AS|
|   SEA| SFO|     AS|
|   PDX| SFO|     UA|
|   SEA| SMF|     AS|
|   SE

Similar to SQL, we can also use the `.select()` method to perform column-wise operations. When we're selecting a column using the `df.colName` notation, we can perform any column operation and the `.select()` method will return the transformed column. For example,

> `flights.select(flights.air_time/60)`

returns a column of flight durations in hours instead of minutes. We can also use the `.alias()` method to rename a column we're selecting. So if we wanted to `.select()` the column duration_hrs (which isn't in our DataFrame) we could do

> `flights.select((flights.air_time/60).alias("duration_hrs"))`

The equivalent Spark DataFrame method `.selectExpr()` takes SQL expressions as a string:

> `flights.selectExpr("air_time/60 as duration_hrs")`

with the SQL as keyword being equivalent to the `.alias()` method. 

To select multiple columns, we can pass multiple strings.

In [27]:
# Define avg_speed
avg_speed = (flights.distance/(flights.air_time/60)).alias("avg_speed") # The result has a dataype of Spark Column
print(type(avg_speed))

print("-" * 5)

# Select the correct columns
speed1 = flights.select("origin", "dest", "tailnum", avg_speed)
print(type(speed1))
speed1.show()

print("-" * 5)

# Create the same table using a SQL expression
speed2 = flights.selectExpr("origin", "dest", "tailnum", "distance/(air_time/60) as avg_speed")
print(type(speed2))
speed2.show()

<class 'pyspark.sql.column.Column'>
-----
<class 'pyspark.sql.dataframe.DataFrame'>
+------+----+-------+------------------+
|origin|dest|tailnum|         avg_speed|
+------+----+-------+------------------+
|   SEA| LAX| N846VA| 433.6363636363636|
|   SEA| HNL| N559AS| 446.1666666666667|
|   SEA| SFO| N847VA|367.02702702702703|
|   PDX| SJC| N360SW| 411.3253012048193|
|   SEA| BUR| N612AS| 442.6771653543307|
|   PDX| DEN| N646SW|491.40495867768595|
|   PDX| OAK| N422WN|             362.0|
|   SEA| SFO| N361VA| 415.7142857142857|
|   SEA| SAN| N309AS| 466.6666666666667|
|   SEA| ORD| N564AS| 521.5151515151515|
|   SEA| LAX| N323AS| 440.3076923076923|
|   SEA| PHX| N305AS|431.29870129870125|
|   SEA| LAS| N433AS| 409.6062992125984|
|   SEA| ANC| N765AS|474.75409836065575|
|   SEA| SFO| N713AS| 315.8139534883721|
|   PDX| SFO| N27205| 366.6666666666667|
|   SEA| SMF| N626AS|477.63157894736844|
|   SEA| MDW| N8634A|481.38888888888886|
|   SEA| BOS| N597AS| 516.4137931034483|
|   PDX| BUR| 

---

## Aggregating
All of the common aggregation methods, like `.min()`, `.max()`, and `.count()` are `GroupedData` methods. These are created by calling the `.groupBy()` DataFrame method. For example, to find the minimum value of a column, `col`, in a DataFrame, `df`, we can do

> `df.groupBy().min("col").show()`

This creates a `GroupedData` object (so we can use the `.min()` method), then finds the minimum value in `col`, and returns it as a DataFrame.

Now we're ready to do some aggregating of our own!


Finding the length of the shortest (in terms of distance) flight that left PDX by first .filter()ing and using the .min() method. Perform the filtering by referencing the column directly, not passing a SQL string.

In [28]:
flights.filter(flights.origin=='PDX').groupBy().min('distance').show()

AnalysisException: ignored

Finding the length of the longest (in terms of time) flight that left `SEA` by filter()ing and using the .max() method. Performing the filtering by referencing the column directly, not passing a SQL string.

In [0]:
flights.filter(flights.origin=='SEA').groupBy().max('duration_hrs').show()

# Part 3: Getting started with machine learning pipelines
PySpark has built-in, cutting-edge machine learning routines, along with utilities to create full machine learning pipelines. We'll learn about them in this part.

# Part 4: Model tuning and selection
In this last part, we'll apply what you've learned to create a model that predicts which flights will be delayed.