In addition to replacing null values like we did with drop and fill, ther are more flexible options that we can use with more than just null values. Probable the most common use case is to replace all values in a certain column according to their current vlaue. The only requirement is that this vlaue be the same type as the original value.

## Working with Comlext Types

In [1]:
from pyspark.sql.functions import struct

## UDF

In [2]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

udfExampleDF = spark.range(5).toDF("num")

def power3(double_value):
    return double_value ** 3

power3(2.0)

8.0

In [5]:
from pyspark.sql.functions import udf
power3udf = udf(power3)

In [6]:
from pyspark.sql.functions import col
udfExampleDF.select(power3udf(col("num"))).show()

+-----------+
|power3(num)|
+-----------+
|          0|
|          1|
|          8|
|         27|
|         64|
+-----------+



In [8]:
udfExampleDF.selectExpr("power3udf(num)").show()

AnalysisException: "Undefined function: 'power3udf'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0"

In [9]:
from pyspark.sql.types import IntegerType, DoubleType

spark.udf.register("power3py", power3, DoubleType())

<function __main__.power3(double_value)>

In [10]:
udfExampleDF.selectExpr("power3py(num)").show()

+-------------+
|power3py(num)|
+-------------+
|         null|
|         null|
|         null|
|         null|
|         null|
+-------------+



# Chapter 5 Aggregations

Aggregating is the act of collecting something together and is a cornerstone of big data analytics. In an aggregation you will specify a key or grouping and an aggregation function that specifies how you should transform one or more columns. This function must procude one result for each group given multiple input values. Spark's aggregation capabilities sophisticated and mature, with a variety of different use cases and possibilities. In genera, we use aggregations to summarize numerical data usually by means of some grouping. This might be a summation, a product, or simple counting. Spark also allows us aggregate any kind of value into an array, list or map as we will see inthe complex types part of this chapter.

In addition to working with any types of values, Spark also allows us to create a variety of different groupings types.

## Approximate Count Distinct

However often times we are working with really large datasets and the exact distinct count is irrelevant. In fact getting the distinct count is a very expensive operation and for large datasets it might take a very long time to calculate the exact result. There are times when an approximation to a certain degree of accuracy will work just fine.

In [1]:
from pyspark.sql.functions import approx_count_distinct

## First and Last

We can get the first and last values from a DataFrame with the obviously named functions. This will be based on the rows in the DataFrame, not on the values in the DataFrame.

## Variance and Standard Deviation

Calculating the mean naturally brings up questions about the variance and standard deviation. These are both maesures of the spread of the data around the mean. The variance is the average of the squared differences from the mean and the standard deviation is the squared root of the variance. These can be calculated in Saprk with their respecitve functions, however something to note is that Spark has both the formula for the sample standard deviation as well as the formula for the population standard deviation. These are fundamental different statistical formulae that it is important to differentiate between. By default, Spark will perform the formula for the sample standard deviation or variance if you use the variance or stddev functions.

You can also specify these explicitly or refer to the population standard deviation or variance.

## Covariance and Correlation

We discussed single column aggregations but some functions compare the interactions of the values in two difference columns together. Two of these functions are the covariance and correlation. Correlation measure the Pearson correlation coefficient, which is sclaed between -1 and +1. The covariance is scaled according to the inputs in the data.

Covariance, like variance above, can be calculated either as the sampel covariance or the population covariance. Therefore it can be important to specift which formula you want to be using. Correlation has no notion of this and therefore does not have calculations for population or sample.

## Aggregating to Complex Types

Spark allows users to perform aggregations not just of numerical values using formulas but also to Spark's complex types. For example, we can collect a list of values present in a given column or only the unique values by collecting to set. 
This can be used to perform some more programmatic access later on in the pipeline or pass the entire collection in a UDF.

## Grouping

Thus far we only performed DataFrame level aggregations. A more common task is to perform calculations based on groups in the data. This is most commonly performed on categorical data where we group our data on one column and perform some calculations on the other columns that end up in the group.

The best explanation for this is probably to start performing some grouping. The first we will perform will be a count, just as we did before. We will group by each unique invoice number and get the count of itmes on that invoice. Notice that this returns another DataFrame and is lazily performed.

When we perform this grouping we do it in two phases. First we specify the column(s) that we would like to group on, then we specify our aggregation(s). The first step returns a RelationalGroupedDataset and the second step returns a DataFrame.

## Grouping with expressions

Now counting as we saw previously, is a bit of a special case because it exists as a method. Usually we prefer to use the count function(the same function that we saw earlier in this chapter). Howeer rather than passing that function as an expression into a select statement, we specify it as inside of agg. This allows for passing in arbitrary expressions that just need to have some aggregation specified. We can even do things like alias a column after transforming it for later use in our data flow.

## Grouping with Maps

Sometimes it can be easier to specify your transfomrations as a series of Maps where the key is the column and the value is the aggregation function that you would like to perform.You can reuse multiple column names if you specify them inline as well.

## Window Functions

# Rollups

Now thus far we've been looking at explicit groupins. When we set our grouping keys of multiple columns, Spark will look at those and look at the actual combinations that are visible in the dataset. A Rollup is a multidimensional aggregation that performs a variety of group by style calculations for us.

Now that we prepared our data, we can perform our rollup. This rollup will look across time and space and will create a new DataFrame that includes the grand total over all dates, the grand total for each date in the DataFrame, and the sub total for each country on each date in the dataFrame

## User-Defined Aggregation Functions

User-Dfined Aggregation Functions or UDAFs are a way for uses to define their own aggregation functions based on custom formulae or business rules. These UDAFs can be used to compute custom calculations over groups of input data. Spark maintains a single AggregationBuffer to store intermediate results for every group of input data.

To create a UDAF you must inherit from the base class UserDefinedAggregateFunction and implement the following methods.

In [6]:
person = spark.createDataFrame([
    (0, "Bill Chambers", 0, [100]),
    (1, "Matei Zaharia", 1, [500, 250, 100]),
    (2, "Michael Armbrust", 1, [250, 100])])\
    .toDF("id", "name", "graduate_program", "spark_status")

In [7]:
graduateProgram = spark.createDataFrame([
    (0, "Masters", "School of Information", "UC Berkeley"),
    (2, "Masters", "EECS", "UC Berkeley"),
    (1, "Ph.D.", "EECS", "UC Berkeley")])\
    .toDF("id", "degree", "department", "school")

In [8]:
sparkStatus = spark.createDataFrame([
    (500, "Vice President"),
    (250, "PMC Member"),
    (100, "Contributor")])\
    .toDF("id", "status")

In [10]:
person.createOrReplaceTempView("person")
graduateProgram.createOrReplaceTempView("graduateProgram")
sparkStatus.createOrReplaceTempView("sparkStatus")

In [14]:
person["graduate_program"] == graduateProgram['id']

Column<b'(graduate_program = id)'>

In [15]:
joinExpression = person["graduate_program"] == graduateProgram["id"]

In [16]:
person.join(graduateProgram, joinExpression).show()

+---+----------------+----------------+---------------+---+-------+--------------------+-----------+
| id|            name|graduate_program|   spark_status| id| degree|          department|     school|
+---+----------------+----------------+---------------+---+-------+--------------------+-----------+
|  0|   Bill Chambers|               0|          [100]|  0|Masters|School of Informa...|UC Berkeley|
|  1|   Matei Zaharia|               1|[500, 250, 100]|  1|  Ph.D.|                EECS|UC Berkeley|
|  2|Michael Armbrust|               1|     [250, 100]|  1|  Ph.D.|                EECS|UC Berkeley|
+---+----------------+----------------+---------------+---+-------+--------------------+-----------+



In [17]:
joinType = "inner"
person.join(graduateProgram, joinExpression, joinType).show()

+---+----------------+----------------+---------------+---+-------+--------------------+-----------+
| id|            name|graduate_program|   spark_status| id| degree|          department|     school|
+---+----------------+----------------+---------------+---+-------+--------------------+-----------+
|  0|   Bill Chambers|               0|          [100]|  0|Masters|School of Informa...|UC Berkeley|
|  1|   Matei Zaharia|               1|[500, 250, 100]|  1|  Ph.D.|                EECS|UC Berkeley|
|  2|Michael Armbrust|               1|     [250, 100]|  1|  Ph.D.|                EECS|UC Berkeley|
+---+----------------+----------------+---------------+---+-------+--------------------+-----------+



In [18]:
from pyspark.sql.functions import expr
person\
    .withColumnRenamed("id", "personId")\
    .join(sparkStatus, expr("array_contains(spark_status, id)"))\
    .take(5)

[Row(personId=0, name='Bill Chambers', graduate_program=0, spark_status=[100], id=100, status='Contributor'),
 Row(personId=1, name='Matei Zaharia', graduate_program=1, spark_status=[500, 250, 100], id=500, status='Vice President'),
 Row(personId=1, name='Matei Zaharia', graduate_program=1, spark_status=[500, 250, 100], id=250, status='PMC Member'),
 Row(personId=1, name='Matei Zaharia', graduate_program=1, spark_status=[500, 250, 100], id=100, status='Contributor'),
 Row(personId=2, name='Michael Armbrust', graduate_program=1, spark_status=[250, 100], id=250, status='PMC Member')]

## Handling Duplicate Column Names

Arguably one of the most nuiance things that comes up is duplicate column names in your results DataFrame. In a DataFrame, each column has a unique ID inside of Spark's SQL Engine, Catalyst. This unique ID is purely internal and not something that user can directly reference. That means when you have a DataFrame with duplicate column names, referring to one column can be quite difficult.

This arises in two distinct situations:

## How Spark performs Joins

Understanding how Spark performs joins means understanding the two core resources at play, the node-to-node communication strategy and per node computation strategy. These internals are likely irrelevant to your business problem, however understanding how Spark performs joins can mean the difference between a job that completely quickly or never completes at all.

## Node-to-Node Communication Strategies

There are two different approachs Spark can take when it comes to communication. Spark will either incur a shuffle join, which results in an all-to-all communication or a broadcast join where one of the DataFrames you work with is uplicated around the cluster which, in general, results in lower total communication that a shuffle join. Let's talk through these in a little bit less abstract terms.

In a shuffle join, every node will talk to every other node and they will share data according to which node has a certain key or set of keys. These joins are expresive because the network can get congested with traffic, especially if your data is not partitioned well.

This join describes taking a large table of data and joining it to another large table of data. An example of this might be that a company receive trillions of internet-of-things messages every data. You need to compare day over day change by joining on deviceId, messageType and date in one column and date - 1 day in order to see changes in day over day traffic and message types.

## Parquet Files

Apache Parquet is an open source column-oriented data store that provides a variety of storage optimizations, especially for analytics workloads. It provides columnar compression in order to save storage space and allows for reading individual columns instead of entire files. It is a file format that works exceptionally well with Apache Spark and is the default file format. We recommend writing data out to Parquet for long-term storage as reading form a parquet file always be more efficient than json or csv. Another advantage of Parquet is that it supports complex types. That means that if your column is an array, map, or struct - you'll still be able to read and write that file without issue.

## Reading Parquet Files

Parquet has exceptionally few options because it enforces its own schema when storing data. Additionally all we have to set is the format and we are good to go. We can set the schema if we have strict requirements for what our DataFrame should look like, however often times this is not necessary because we can leverage schema on read whcih is similar to the inferSchema of csv files however it is more powerful because the schema is built into the file itself.

### Parquet Options

There are few parquet options because it has a well defined specification that aligns well with the concepts in Spark.

### Reading from Databases in Parallel

All throughout this book we have talked about partitioning and its importance in data processing. When we read in a set of parquet files for example, we will get one Spark partition per file. When we reading from SQL databases, we by default will always get one partition. Now this can be helpful if that dataset is small and we'd like to broadcast it out to all other workers but if it's a larger dataset sometimes it is better to read it into multiple partitions and even control what the keys of those partitions are.

## Advanced IO Concepts

### Reading Data in Parallel

## Big Data and SQL : Hive

Before Spark's rise, Hive was the de facto big data SQL access layer. Originally developed at Facebook, Hive became an incredibly popular tool across industry for performing SQL operations on big data. In many ways it helped propel Hadoop into different industries because analysts could run SQL queries.

## SparkSQL Thrift JDBC/ODBC Server

Spark provides a JDBC interface by which either you or a remote program connects to the Spark driver into order to execute Spark SQL queries. A common use case might be a for a business analyst to connect a business intelligence software like Tableau to Spark. The Thrift JDBC/ODBC server implemented here corresponds to the HiveServer2 in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1.

# Spark SQL

Spark SQL is arguably one of the most important and powerful concepts in Spark. This chapter will introduce the core concepts in Spark SQL that you need to understand. This chapter will not rewrite the ANSI-SQL specification or enumerate every single kind of SQL expression. If you read any other parts of this book, you will notice that we try to include SQL code wherevert we include DataFrame code to make it easy to cross reference with code examples. Other examples are available in the appendix and reference sections.

## Spark SQL CLI

Configuration of Hive is done by placing your hive-site.xml, core-site.xml and hdfs-site.xml files in conf/. 

## Spark's Programmatic SQL Interface

In addition to setting up a server, you can also execute sql in ad hoc manner via any of Spark's language. This is done via the moethod sql on the SparkSession object. this will return a DataFrame as we will see later in this chapter.

## Tables

To do anything useful with Spark SQL, we first need to define tables. Tables are logically equivalent to a DataFrame in that they are a structure of data that we execute commands against. WE can join tables, filter then, aggregate them and many different manipulations that we saw in previous chapters. The core difference between tables and DataFrames is that while we define DataFrames in the scope of a programming language, we define tables inside of a database. This means when you create a table, it will belong to the default database. We will discuss databases more later on in the chapter.

## Views

Now that we created a table another thing we can define a view. A view specifies a set of transformations on top of an existing table. Views can be either just a saved query plan to be executed against the source table or they can be materialized which means that the results are precomputed.

## Creating Views

To an end user, views are displayed as tables except rather than rewriting all of the data to a new location, they simply perform a transformation on the source data at query time. This might be a filter, select, or potentailly an even larger group by or rollup. For example, we can create a view where the destination must be United States in order to see only flights to the USA.

## Spark Managed Tables

One important note is the concept of managed vs unmanaged table. Tables store two important pieces of information. The data within the tables as well as the data about the tables, that is the metadata. You can have Spark manage the metadata for a set of files, as well as the data. When you define a table from files on disk, you are defining an unmanaged table. When you use saveAsTable on a DataFrame you are creating a managed table where Spark will keep track of all of the relevant information for you.

This will read in our table and write it out to a new location in Spark format. We can see this reflected in the new explain plan. In the explain plan you will also notice that this writes to the default hive warehouse location. You can set this by setting the spark.sql.warehouse.dir configuration to the directory of your choosing at SparkSession creation time. By default Spark sets this to /users/hive/warehouse.

## Creating External Tables

Noew as we mentioned in the beginning of this chapter, Hive was one of the first big data SQL systems and Spark SQL is completely compatible with Hive SQL statements. One of the use cases you may have here will be to port your legacy hive statements to Spark SQL. Luckily you can just copy and paste your Hive statements directly into Spark SQL. For example below I am creating an unmanaged table. Spark will manage the metadata about this table however, the files are not managed by Spark at all. We create this table with the CREATE EXTERNAL TABLE statement.

## Dropping Unmanaged Tables

If we are dropping an unmanaged table no data will be removed but we won't be able to refer to this data by the table name any longer.

## Subqueries

Subqueries allow you to specify queries within other queries. This can allow you to specify some sophisticated logic inside of your SQL. In Spark ther are two fundamental subqueries. Correlated Sbuqueries use some information from the outer scope of the query in order to supplement information in the subquery. Uncorrelated subqueries include no information from the outer scope. Each of these queries can return one or more values. Spark also includes support for predicate subqueries which allow for filtering based on values.

## Correlated Predicated Subqueries

Correlated predicate subqueries allows us to use information from the outer scope in our inner query. For example, if we want to see whether or not we have a flight that will take you back from your destination country we could do so bby checking whether or not there was a flight that had the destination country as an origin and a flight that had the origin country as a destination.

## Chapter 9 Datasets

Datasets are the foundational type of the Structured APIS. Earlier in this section we worked with DataFrames, which are Datasets of Type Row, and are available across Spark's different languages. Datasets are a strictly JVM language feature that only work with Scala and Java. Datasets allow you to define the object that each row in your Dataset will consist of. In Scala this will be a case class object that essentially defines a schema that you can leverage and in Java you will define a Java Bean. Experienced users often refer to Datasets as the "typed set of APIs" in Spark. See the Structured API Overview Chapter for more information.

Whe you use the DataFrame API, you do not creeate Strings or Integers but Spark munipulates the data for you by manipulating the Row. When you use the Datset API, for every row it touches with user code, Spark converts the Spark Row format to the case class object you specify when you create your Dataset. This will slow down your operations but can provide more flexibility. You will notice a performance difference but this is a far different order of magnitude from what you might see from something like a Python UDF because the performance costs are not as extreme as switching programming langauges but it is an important thing to keep in mind.