# Cleaning Data with PySpark - Part 1

## DataFrame details
A review of DataFrame fundamentals and the importance of data cleaning.

In [7]:
BUCKET = 'driven-actor-210609'

In [10]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

In [4]:
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/12 20:54:52 INFO SparkEnv: Registering MapOutputTracker
25/03/12 20:54:52 INFO SparkEnv: Registering BlockManagerMaster
25/03/12 20:54:53 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/03/12 20:54:53 INFO SparkEnv: Registering OutputCommitCoordinator


### Defining a schema
Creating a defined schema helps with data quality and import performance. As mentioned during the lesson, we'll create a simple schema to read in the following columns:
- Name
- Age
- City

The Name and City columns are StringType() and the Age column is an IntegerType().

In [None]:
# Import the pyspark.sql.types library
from pyspark.sql.types import *

# Define a new schema using the StructType method
people_schema = StructType([
  # Define a StructField for each field
  StructField('name', StringType(), False),
  StructField('age', IntegerType(), False),
  StructField('city', StringType(), False)
])

# Load data with specific schema
people_df = spark.read.format('csv').load(name='rawdata.csv', schema=people_schema)

### Using lazy processing
Lazy processing operations will usually return in about the same amount of time regardless of the actual quantity of data. Remember that this is due to Spark not performing any transformations until an action is requested.

For this exercise, we'll be defining a Data Frame (`aa_dfw_df`) and add a couple transformations. Note the amount of time required for the transformations to complete when defined vs when the data is actually queried. These differences may be short, but they will be noticeable. When working with a full Spark cluster with larger quantities of data the difference will be more apparent.

In [8]:
file_path = f'gs://{BUCKET}/pyspark/datasets/AA_DFW_2018_Departures.csv.gz'
# file_path = 'datasets/AA_DFW_2018_Departures.csv.gz'

In [11]:
# Load the CSV file
aa_dfw_df = spark.read.format('csv').options(Header=True).load(file_path)

# Add the airport column using the F.lower() method
aa_dfw_df = aa_dfw_df.withColumn('airport', F.lower(aa_dfw_df['Destination Airport']))

# Show the DataFrame
aa_dfw_df.show()

                                                                                

+-----------------+-------------+-------------------+-----------------------------+-------+
|Date (MM/DD/YYYY)|Flight Number|Destination Airport|Actual elapsed time (Minutes)|airport|
+-----------------+-------------+-------------------+-----------------------------+-------+
|       01/01/2018|         0005|                HNL|                          498|    hnl|
|       01/01/2018|         0007|                OGG|                          501|    ogg|
|       01/01/2018|         0043|                DTW|                            0|    dtw|
|       01/01/2018|         0051|                STL|                          100|    stl|
|       01/01/2018|         0075|                DCA|                          147|    dca|
|       01/01/2018|         0096|                STL|                           92|    stl|
|       01/01/2018|         0103|                SJC|                          227|    sjc|
|       01/01/2018|         0119|                OGG|                          5

### Saving a DataFrame in Parquet format
When working with Spark, you'll often start with CSV, JSON, or other data sources. This provides a lot of flexibility for the types of data to load, but it is not an optimal format for Spark. The `Parquet` format is a columnar data store, allowing Spark to use predicate pushdown. This means Spark will only process the data necessary to complete the operations you define versus reading the entire dataset. This gives Spark more flexibility in accessing the data and often drastically improves performance on large datasets.

In this exercise, we're going to practice creating a new Parquet file and then process some data from it.

The `spark` object and the `df1` and `df2` DataFrames have been setup for you.

In [20]:
file_path_2017 = f'gs://{BUCKET}/pyspark/datasets/AA_DFW_2017_Departures.csv.gz'
df1 = spark.read.format('csv').options(Header=True).load(file_path_2017)

In [21]:
file_path_2018 = f'gs://{BUCKET}/pyspark/datasets/AA_DFW_2018_Departures.csv.gz'
df2 = spark.read.format('csv').options(Header=True).load(file_path_2018)

In [23]:
# View the row count of df1 and df2
print("df1 Count: %d" % df1.count())
print("df2 Count: %d" % df2.count())

# Combine the DataFrames into one
df3 = df1.union(df2)

file_path_all = f'gs://{BUCKET}/pyspark/datasets/AA_DFW_ALL.parquet'
# Save the df3 DataFrame in Parquet format
df3.coalesce(1).write.parquet(file_path_all, mode='overwrite')

# Read the Parquet file into a new DataFrame and run a count
print(spark.read.parquet(file_path_all).count())

df1 Count: 139358
df2 Count: 119910


                                                                                

259268


### SQL and Parquet
Parquet files are perfect as a backing data store for SQL queries in Spark. While it is possible to run the same queries directly via Spark's Python functions, sometimes it's easier to run SQL queries alongside the Python options.

For this example, we're going to read in the Parquet file we created in the last exercise and register it as a SQL table. Once registered, we'll run a quick query against the table (aka, the Parquet file).

The `spark` object and the `AA_DFW_ALL.parquet` file are available for you automatically.

In [24]:
file_path_all = f'gs://{BUCKET}/pyspark/datasets/AA_DFW_ALL.parquet'

In [31]:
# Read the Parquet file into flights_df
flights_df = spark.read.options(Header=True).parquet(file_path_all)

flights_df = flights_df.withColumnRenamed('Actual elapsed time (Minutes)', 'flight_duration')

# Register the temp table
flights_df.createOrReplaceTempView('flights')

# Run a SQL query of the average flight duration
avg_duration = spark.sql('SELECT avg(flight_duration) from flights').collect()[0]
print('The average flight time is: %d' % avg_duration)

The average flight time is: 151
