# PySpark Lab2

In this lab we will see and test some more functionality of Spark.

As in the previous lab, we start the notebook by installing pyspark.

In [None]:
!pip install pyspark

## Get the dataset

In order to have a fast way to get the dataset we have prepared for this lab, we created a link to a file containing it in another google account, and written down all the necessary steps to get the file in the current path.

This file is 2007.csv, and contains information about flights during the year 2007.

Now, execute the following code cell.

In [None]:
!gdown --id "1QJ-wDWTc3oM_jbSlb5cB3HPJ7Jwy5iAH"
!unzip SparkTutorials2i3.zip
!mv Spark_Tutorial2/2006.csv Spark_Tutorial2/2007.csv Spark_Tutorial2/2008.csv .
!rm -r __MACOSX/ Spark_Tutorial3/ Spark_Tutorial2 SparkTutorials2i3.zip
!ls

## Map Reduce in Spark

For this lab we will be using three files. Let's load them!

In [None]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]").appName('testSparkSession').getOrCreate()

df2006 = spark.read.format("csv").option("header", "true").option("nullValue","NA").option("inferSchema", "true").load("2006.csv")
df2007 = spark.read.format("csv").option("header", "true").option("nullValue","NA").option("inferSchema", "true").load("2007.csv")
df2008 = spark.read.format("csv").option("header", "true").option("nullValue","NA").option("inferSchema", "true").load("2008.csv")

print ("df2006 number of partitions", df2006.rdd.getNumPartitions())
print ("df2007 number of partitions", df2007.rdd.getNumPartitions())
print ("df2008 number of partitions", df2008.rdd.getNumPartitions())

Now we loaded the three files and checked the number of partitions for each of them.

Let's check the number of elements of each dataframe too.

In [None]:
print ("df2006 number of elements", df2006.count())
print ("df2007 number of elements", df2007.count())
print ("df2008 number of elements", df2008.count())

Let's now unify all data frames into one

In [None]:
df1 = df2006.union(df2007).union(df2008)

How many elements?

In [None]:
df1.count()

How many partitions?

In [None]:
df1.rdd.getNumPartitions()

Let's now do some filterting.

First we pick some columns, and remove the na values

In [None]:
df2 = df1.select("Year", "Month", "Origin", "Dest", "ArrDelay", "DepDelay")
df3 = df2.na.drop()

Now, as in the other lab we compute the sum of arrival and departure delays, and store it in a new column

In [None]:
from pyspark.sql.functions import expr
df4 = df3.withColumn("SumDelay", expr("ArrDelay + DepDelay"))

Again, we will use the cache functionality, to execute faster from this point

In [None]:
df4.cache()

Let's use grouping operations, to get for instance the averafe SumDelay for each Origin

In [None]:
from pyspark.sql.functions import avg
df5 = df4.groupBy("Origin").agg(avg("SumDelay"))
df5.show()
df5.count()

We can also rename a column

In [None]:
df6 = df5.withColumnRenamed("avg(SumDelay)", "Average Delay")
df6.show()