##Spark DataFrames

###Imports

For these examples, we just need to import two **pyspark.sql** libraries:
- `types`
- `functions`

We need `pyspark.sql.types` to define schemas for the DataFrames. The `pyspark.sql.functions` library contains all of the functions specific to SQL and DataFrames in **PySpark**.

In [3]:
from pyspark.sql.types import *  # Necessary for creating schemas
from pyspark.sql.functions import * # Importing PySpark functions

###Creating DataFrames

#### Making a Simple DataFrame from a Tuple List

In [6]:
# Make a tuple list
a_list = [('a', 1), ('b', 2), ('c', 3)]

# Create a Spark DataFrame, without supplying a schema value
df_from_list_no_schema = \
sqlContext.createDataFrame(a_list)

# Print the DF object
print (df_from_list_no_schema)

# Print a collected list of Row objects
print (df_from_list_no_schema.collect())

# Show the DataFrame
df_from_list_no_schema.show()

#### Making a Simple DataFrame from a Tuple List and a Schema

In [8]:
# Create a Spark DataFrame, this time with schema
df_from_list_with_schema = \
sqlContext.createDataFrame(a_list, ['letters', 'numbers']) # this simple schema contains just column names

# Show the DataFrame
df_from_list_with_schema.show()

# Show the DataFrame's schema
df_from_list_with_schema.printSchema()

#### Making a Simple DataFrame from a Dictionary

In [10]:
# Make a dictionary
a_dict = [{'letters': 'a', 'numbers': 1},
          {'letters': 'b', 'numbers': 2},
          {'letters': 'c', 'numbers': 3}]

# Create a Spark DataFrame from the dictionary
df_from_dict = \
(sqlContext
 .createDataFrame(a_dict)) # You will get a warning about this

# Show the DataFrame
df_from_dict.show()

# inferring schema from dict is deprecated, please use pyspark.sql.Row instead


#### Making a Simple DataFrame Using a StructType Schema + RDD

In [12]:
# Define the schema
schema = StructType([
    StructField('letters', StringType(), True),
    StructField('numbers', IntegerType(), True)])

# Create an RDD from a list
rdd = sc.parallelize(a_list)

# Create the DataFrame from these raw components
nice_df = \
(sqlContext
 .createDataFrame(rdd, schema))

# Show the DataFrame
nice_df.show()

In [13]:
# Define the schema
schema = StructType([
    StructField('letters', StringType(), True),
    StructField('numbers', IntegerType(), True)])

# Create an RDD from a list
rdd = sc.parallelize(a_list)

# Create the DataFrame from these raw components
nice_df = \
(sqlContext
 .createDataFrame(rdd, schema))

# Show the DataFrame
nice_df.show()

###Simple Inspection Functions

We now have a `nice_df`, here are some nice functions for inspecting the DataFrame.

In [15]:
# `columns`: return all column names as a list
nice_df.columns

In [16]:
# `dtypes`: get the datatypes for all columns
nice_df.dtypes

In [17]:
# `printSchema()`: prints the schema of the supplied DF
nice_df.printSchema()

In [18]:
# `schema`: returns the schema of the provided DF as `StructType` schema
nice_df.schema

In [19]:
# `first()` returns the first row as a Row while
# `head()` and `take()` return `n` number of Row objects
print (nice_df.first()) # can't supply a value; never a list
print (nice_df.head(2)) # can optionally supply a value (default: 1);
                      # with n > 1, a list
print (nice_df.take(2)) # expects a value; always a list

In [20]:
# `count()`: returns a count of all rows in DF
nice_df.count()

In [21]:
# `describe()`: print out stats for numerical columns
nice_df.describe().show() # can optionally supply a list of column names

In [22]:
# the `explain()` function explains the under-the-hood evaluation process
nice_df.explain()

###Relatively Simple DataFrame Manipulation Functions

Let's use these functions:
- `unionAll()`: combine two DataFrames together
- `orderBy()`: perform sorting of DataFrame columns
- `select()`: select which DataFrame columns to retain
- `drop()`: select a single DataFrame column to remove
- `filter()`: retain DataFrame rows that match a condition

In [24]:
# Take the DataFrame and add it to itself
(nice_df
 .unionAll(nice_df)
 .show())

# Add it to itself twice
(nice_df
 .unionAll(nice_df)
 .unionAll(nice_df)
 .show())

# Coercion will occur if schemas don't align
(nice_df
 .select(['numbers', 'letters'])
 .unionAll(nice_df)
 .show())

(nice_df
 .select(['numbers', 'letters'])
 .unionAll(nice_df)
 .printSchema())

In [25]:
# Sorting the DataFrame by the `numbers` column
(nice_df
 .unionAll(nice_df)
 .unionAll(nice_df)
 .orderBy('numbers')
 .show())

# Sort the same column in reverse order
(nice_df
 .unionAll(nice_df)
 .unionAll(nice_df)
 .orderBy('numbers',
          ascending = False)
 .show())

In [26]:
# `select()` and `drop()` both take a list of column names
# and these functions do exactly what you might expect

# Select only the first column of the DF
(nice_df
 .select('letters')
 .show())

# Re-order columns in the DF using `select()`
(nice_df
 .select(['numbers', 'letters'])
 .show())

# Drop the second column of the DF
(nice_df
 .drop('letters')
 .show())

In [27]:
# The `filter()` function performs filtering of DF rows

# Here is some numeric filtering with comparison operators
# (>, <, >=, <=, ==, != all work)

# Filter rows where values in `numbers` is > 1
(nice_df
 .filter(nice_df.numbers > 1)
 .show())

# Perform two filter operations
(nice_df
 .filter(nice_df.numbers > 1)
 .filter(nice_df.numbers < 3)
 .show())

# Not just numbers! Use the `filter()` + `isin()`
# combo to filter on string columns with a set of values
(nice_df
 .filter(nice_df.letters
         .isin(['a', 'b']))
 .show())

###The 'groupBy' Function and Aggregations

The `groupBy()` function groups the DataFrame using the specified columns, then, we can run aggregation on them. The available aggregate functions are:

- `count()`: counts the number of records for each group
- `sum()`: compute the sum for each numeric column for each group
- `min()`: computes the minimum value for each numeric column for each group
- `max()`: computes the maximum value for each numeric column for each group
- `avg()` or `mean()`: computes average values for each numeric columns for each group
- `pivot()`: pivots a column of the current DataFrame and perform the specified aggregation

Before we get into aggregations, let's load in a **CSV** with interesting data and create a new DataFrame.

You do this with the `spark-csv` package. Documentation on that is available at:
- https://github.com/databricks/spark-csv

The dataset that will be loaded in to demonstrate contains data about flights departing New York City airports (`JFK`, `LGA`, `EWR`) in 2013. It has 336,776 rows and 16 columns.

###Useful Links

There are many more functions... although I tried to cover a lot of ground, there are dozens more functions for DataFrames that I haven't touched upon.

The main project page for Spark:

- http://spark.apache.org

The main reference for PySpark is:

- http://spark.apache.org/docs/latest/api/python/index.html

These examples are available at:

- https://github.com/rich-iannone/so-many-pyspark-examples

Information on the Parquet file format can be found at its project page:

- http://parquet.apache.org

The GitHub project page for `spark-csv` package; contains usage documentation:

- https://github.com/databricks/spark-csv