# Bilingual PySpark: Blending Python & SQL code

In this section, we'll see how we can use python and SQL together with PySpark. 

In [1]:
# Let's import from libs
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
import pyspark.sql.functions as F
import pyspark.sql.types as T

spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/11 15:30:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Comparing  pyspark.sql vs. plain SQL

In [2]:
elements = spark.read.csv(
    "data/elements/Periodic_Table_Of_Elements.csv",
    header=True,
    inferSchema=True,
)

elements.where(F.col("phase") == "liq").groupby("period").count().show()

+------+-----+
|period|count|
+------+-----+
|     6|    1|
|     4|    1|
+------+-----+



```sql
SELECT
    period,
    count(*)
FROM 
    elements
WHERE 
    phase = 'liq'
GROUP BY 
    period;
```

<img src="images/python_sql_comparison.png" height="300px">

Lets see how to get spark data frame using SQL. 

In [3]:
try:
    spark.sql(
        "select period, count(*) from elements "
        "where phase='liq' group by period"
    ).show(5)
except AnalysisException as e:
    print(e)


Table or view not found: elements; line 1 pos 29;
'Aggregate ['period], ['period, unresolvedalias(count(1), None)]
+- 'Filter ('phase = liq)
   +- 'UnresolvedRelation [elements], [], false



Here, PySpark doesn’t make the link between the python variable elements, which points to the data frame, and a potential table elements that can be queried by Spark SQL. To allow a data frame to be queried via SQL, we need to **register** it.

When you want to create a table/view to query with Spark SQL, use the createOrReplaceTempView() method. This method takes a single string parameter, which is the name of the table you want to use. This transformation will look at the data frame referenced by the Python variable on which the method was applied and will create a Spark SQL reference to the same data frame.

In [4]:
elements.createOrReplaceTempView("elements")

spark.sql(
    "select period, count(*) from elements where phase='liq' group by period"
).show(5)

23/01/11 15:30:47 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
+------+--------+
|period|count(1)|
+------+--------+
|     6|       1|
|     4|       1|
+------+--------+



Now, how to manage these registered views/tables. Spark has a way of managing it via the **Catalog**. PySpark has four methods to create temporary views, and they look quite similar at first glance:
- `createGlobalTempView()`
- `createOrReplaceGlobalTempView()`
- `createOrReplaceTempView()`
- `createTempView()`

The Spark catalog is an object that allows working with Spark SQL tables and views. A lot of its methods have to do with managing the metadata of those tables, such as their names and the level of caching.

In [5]:
spark.catalog

<pyspark.sql.catalog.Catalog at 0x7fc3b8279f70>

In [6]:
spark.catalog.listTables()

[Table(name='elements', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [7]:
spark.catalog.dropTempView("elements")
spark.catalog.listTables()

[]

## SQL and PySpark

data downloaded from [here](https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q3_2019.zip). To know more about the dataset check this [link](http://mng.bz/4jZa)


In [8]:
DIRECTORY = "data/backblaze/"

backblaze_2019 = spark.read.csv(
    DIRECTORY+"*.csv", 
    header=True,
    inferSchema=True
)


                                                                                

In [9]:
# Setting the layout for each column according to the schema
backblaze_2019 = backblaze_2019.select(
    [
        F.col(x).cast(T.LongType()) if x.startswith("smart") else F.col(x)
        for x in backblaze_2019.columns
    ]
)

backblaze_2019.createOrReplaceTempView("backblaze_stats_2019")

Objective: to perform a quick exploratory data analysis on a subset of the columns presented. Then will reproduce the failure rates that Backblaze computes and identify the models with the greatest and least amount of failures in 2019.

#### Get the rows and columns you want: select and where

In [10]:
spark.sql(
    "select serial_number from backblaze_stats_2019 where failure = 1"
).show(5)


+-------------+
|serial_number|
+-------------+
|     ZA10MCJ5|
|     ZCH07T9K|
|     ZCH0CA7Z|
|     Z302F381|
|     ZCH0B3Z2|
+-------------+
only showing top 5 rows



In [11]:
backblaze_2019.where("failure = 1").select(F.col("serial_number")).show(5)

+-------------+
|serial_number|
+-------------+
|     ZA10MCJ5|
|     ZCH07T9K|
|     ZCH0CA7Z|
|     Z302F381|
|     ZCH0B3Z2|
+-------------+
only showing top 5 rows



#### Grouping similar records together: group by and order by

Let's start looking at the capacity, in gigabytes, of the hard drives included in the data, by model.

In [12]:
spark.sql(
    """SELECT
        model,
        min(capacity_bytes / pow(1024, 3)) min_GB,
        max(capacity_bytes/ pow(1024, 3)) max_GB
    FROM backblaze_stats_2019
    GROUP BY 1
    ORDER BY 3 DESC"""
).show(5)



+--------------------+--------------------+-------+
|               model|              min_GB| max_GB|
+--------------------+--------------------+-------+
| TOSHIBA MG07ACA14TA|             13039.0|13039.0|
|       ST12000NM0117|             11176.0|11176.0|
|       ST12000NM0007|-9.31322574615478...|11176.0|
|HGST HUH721212ALN604|-9.31322574615478...|11176.0|
|HGST HUH721212ALE600|             11176.0|11176.0|
+--------------------+--------------------+-------+
only showing top 5 rows



                                                                                

In [13]:
backblaze_2019.groupby(F.col("model")).agg(
    F.min(F.col("capacity_bytes") / F.pow(F.lit(1024), 3)).alias("min_GB"),
    F.max(F.col("capacity_bytes") / F.pow(F.lit(1024), 3)).alias("max_GB"),
).orderBy(
    F.col("max_GB"), ascending=False
    ).show(5)



+--------------------+--------------------+-------+
|               model|              min_GB| max_GB|
+--------------------+--------------------+-------+
| TOSHIBA MG07ACA14TA|             13039.0|13039.0|
|       ST12000NM0117|             11176.0|11176.0|
|       ST12000NM0007|-9.31322574615478...|11176.0|
|HGST HUH721212ALN604|-9.31322574615478...|11176.0|
|HGST HUH721212ALE600|             11176.0|11176.0|
+--------------------+--------------------+-------+
only showing top 5 rows



                                                                                

#### Filtering after grouping using having

Looking at the results from our query, there are some drives that report more than one capacity. Furthermore, we have some drives that report negative capacity, which is really odd. Let’s focus on seeing how prevalent this is.

Because of the order of the evaluation of operations in SQL, where is always applied before group by. What happens if we want to filter the values of columns created after the group by operation? We use a new keyword: `having`!

In [14]:
spark.sql(
    """SELECT
        model,
        min(capacity_bytes / pow(1024, 3)) min_GB,
        max(capacity_bytes/ pow(1024, 3)) max_GB
    FROM backblaze_stats_2019
    GROUP BY 1
    HAVING min_GB != max_GB
    ORDER BY 3 DESC"""
).show(5)



+--------------------+--------------------+-----------------+
|               model|              min_GB|           max_GB|
+--------------------+--------------------+-----------------+
|HGST HUH721212ALN604|-9.31322574615478...|          11176.0|
|       ST12000NM0007|-9.31322574615478...|          11176.0|
|HGST HUH721010ALE600|-9.31322574615478...|           9314.0|
|       ST10000NM0086|-9.31322574615478...|           9314.0|
|        ST8000NM0055|-9.31322574615478...|7452.036460876465|
+--------------------+--------------------+-----------------+
only showing top 5 rows



                                                                                

In [15]:
backblaze_2019.groupby(F.col("model")).agg(
    F.min(F.col("capacity_bytes") / F.pow(F.lit(1024), 3)).alias("min_GB"),
    F.max(F.col("capacity_bytes") / F.pow(F.lit(1024), 3)).alias("max_GB"),
).where(F.col("min_GB") != F.col("max_GB")).orderBy(
    F.col("max_GB"), ascending=False
).show(5)




+--------------------+--------------------+-----------------+
|               model|              min_GB|           max_GB|
+--------------------+--------------------+-----------------+
|HGST HUH721212ALN604|-9.31322574615478...|          11176.0|
|       ST12000NM0007|-9.31322574615478...|          11176.0|
|HGST HUH721010ALE600|-9.31322574615478...|           9314.0|
|       ST10000NM0086|-9.31322574615478...|           9314.0|
|        ST8000NM0055|-9.31322574615478...|7452.036460876465|
+--------------------+--------------------+-----------------+
only showing top 5 rows



                                                                                

Next, let’s materialize our work, SQL-style
#### Creating new tables/views using the CREATE keyword

Creating a table or a view is very easy in SQL: prefix our query by CREATE TABLE/VIEW. 
Let's reproduce the drive_days and failures that compute the number of days of operation that a model has and the number of drive failures it has had, respectively. 

In [16]:
backblaze_2019.createOrReplaceTempView("drive_stats")

spark.sql(
    """
    CREATE OR REPLACE TEMP VIEW drive_days AS
        SELECT model, count(*) AS drive_days
        FROM drive_stats
        GROUP BY model"""
)

spark.sql(
    """CREATE OR REPLACE TEMP VIEW failures AS
        SELECT model, count(*) AS failures
        FROM drive_stats
        WHERE failure = 1
        GROUP BY model"""
)

drive_days = backblaze_2019.groupby(F.col("model")).agg(
    F.count(F.col("*")).alias("drive_days")
)

failures = (
    backblaze_2019.where(F.col("failure") == 1)
    .groupby(F.col("model"))
    .agg(F.count(F.col("*")).alias("failures"))
)


>Note: ***Creating tables from data in SQL***: You can also create a table from data on a hard drive or HDFS. For this, you can use a modified SQL query. Since we are reading a CSV file, we prefix our path with csv.: ``spark.sql("create table q1 as select * from csv.`./data/backblaze/drive_stats_2019_Q1`")``

#### Adding data to our table using UNION and JOIN

Joins and unions are the only clauses we’ll see that modify the target piece in our SQL statement.

```python
columns_backblaze = ", ".join(q4.columns)

q1.createOrReplaceTempView("Q1")
q2.createOrReplaceTempView("Q2")
q3.createOrReplaceTempView("Q3")
q4.createOrReplaceTempView("Q4")

spark.sql(
"""
    CREATE OR REPLACE TEMP VIEW backblaze_2019 AS
    SELECT {col} FROM Q1 UNION ALL
    SELECT {col} FROM Q2 UNION ALL
    SELECT {col} FROM Q3 UNION ALL
    SELECT {col} FROM Q4
""".format(
    col=columns_backblaze
    )
)

backblaze_2019 = (
    q1.select(q4.columns)
    .union(q2.select(q4.columns))
    .union(q3.select(q4.columns))
    .union(q4)
)
```

In [17]:
# Joining our tables
spark.sql(
    """select
            drive_days.model,
            drive_days,
            failures
        from drive_days
        left join failures
        on
            drive_days.model = failures.model"""
).show(5)


                                                                                

+-------------+----------+--------+
|        model|drive_days|failures|
+-------------+----------+--------+
|  ST9250315AS|        89|    null|
|  ST4000DM000|   1796728|      72|
|ST12000NM0007|   3212635|     364|
|  ST8000DM005|      2280|       1|
|   ST320LT007|        89|    null|
+-------------+----------+--------+
only showing top 5 rows



In [18]:
drive_days.join(failures, on="model", how="left").show(5)



+-------------+----------+--------+
|        model|drive_days|failures|
+-------------+----------+--------+
|  ST9250315AS|        89|    null|
|  ST4000DM000|   1796728|      72|
|ST12000NM0007|   3212635|     364|
|  ST8000DM005|      2280|       1|
|   ST320LT007|        89|    null|
+-------------+----------+--------+
only showing top 5 rows



                                                                                

#### Organizing SQL code via subqueries

A subquery simply replaces a table name with a standalone SQL query. In the example, we can see that the name of the table has been replaced by the SELECT query that formed the table. We can alias the table referred to in the subquery by adding the name at the end of the statement, after the closing parenthesis.

In [19]:
spark.sql(
"""
SELECT failures.model,
       failures / drive_days failure_rate
FROM   (
        SELECT 
            model,
            count(*) AS drive_days
        FROM   drive_stats
        GROUP  BY model) drive_days
INNER JOIN (
        SELECT 
            model,
            count(*) AS failures
        FROM   drive_stats
        WHERE  failure = 1
        GROUP  BY model) failures
ON drive_days.model = failures.model
ORDER  BY 2 DESC 
"""
).show(5)



+--------------------+--------------------+
|               model|        failure_rate|
+--------------------+--------------------+
|       ST12000NM0117|0.019305019305019305|
|Seagate BarraCuda...|6.341154090044388E-4|
|  TOSHIBA MQ01ABF050|5.579360828423496E-4|
|         ST8000DM005|4.385964912280702E-4|
|          ST500LM030| 4.19639110365086E-4|
+--------------------+--------------------+
only showing top 5 rows



                                                                                

Subqueries are cool but can be hard to read and debug, since you are adding complexity into the main query. This is where common table expressions, or CTEs, are especially useful. A CTE is a table definition, just like in the subquery case. The difference here is that you put them at the top of your main statement (before your main SELECT) and prefix with the word WITH. In the next listing, I take the same statement as the subquery case but use two CTE instead. These can also be considered makeshift CREATE statements that get dropped at the end of the query, just like the with keyword in Python.

In [20]:
spark.sql(
"""
WITH drive_days as (
    SELECT
        model,
        count(*) AS drive_days
    FROM drive_stats
    GROUP BY model),
failures as (
    SELECT
        model,
        count(*) AS failures
    FROM drive_stats
    WHERE failure = 1
    GROUP BY model)

SELECT
    failures.model,
    failures / drive_days failure_rate
FROM drive_days
INNER JOIN failures
ON
    drive_days.model = failures.model
ORDER BY 2 desc
"""
).show(5)



+--------------------+--------------------+
|               model|        failure_rate|
+--------------------+--------------------+
|       ST12000NM0117|0.019305019305019305|
|Seagate BarraCuda...|6.341154090044388E-4|
|  TOSHIBA MQ01ABF050|5.579360828423496E-4|
|         ST8000DM005|4.385964912280702E-4|
|          ST500LM030| 4.19639110365086E-4|
+--------------------+--------------------+
only showing top 5 rows



                                                                                

In [21]:
def failure_rate(drive_stats):
    
    drive_days = drive_stats.groupby(F.col("model")).agg(
        F.count(F.col("*")).alias("drive_days")
    )

    failures = (
        drive_stats.where(F.col("failure") == 1)
        .groupby(F.col("model"))
        .agg(F.count(F.col("*")).alias("failures"))
    )
    answer = (
        drive_days.join(failures, on="model", how="inner")
        .withColumn("failure_rate", F.col("failures") / F.col("drive_days"))
        .orderBy(F.col("failure_rate").desc())
        )

    return answer 

failure_rate(backblaze_2019).show(5)



+--------------------+----------+--------+--------------------+
|               model|drive_days|failures|        failure_rate|
+--------------------+----------+--------+--------------------+
|       ST12000NM0117|       259|       5|0.019305019305019305|
|Seagate BarraCuda...|      1577|       1|6.341154090044388E-4|
|  TOSHIBA MQ01ABF050|     44808|      25|5.579360828423496E-4|
|         ST8000DM005|      2280|       1|4.385964912280702E-4|
|          ST500LM030|     21447|       9| 4.19639110365086E-4|
+--------------------+----------+--------+--------------------+
only showing top 5 rows



                                                                                

#### Using Python to increase resiliency and simplifying the data reading stage



In [28]:
from functools import reduce

DATA_DIRECTORY = "data/backblaze/"
DATA_FILES = ['2019-09-*.csv']

data = [
    spark.read.csv(DATA_DIRECTORY + file, header=True, inferSchema=True)
    for file in DATA_FILES
]

common_columns = list(
    reduce(lambda x, y: x.intersection(y), [set(df.columns) for df in data])
)

assert set(["model", "capacity_bytes", "date", "failure"]).issubset(
    set(common_columns)
)

full_data = reduce(
    lambda x, y: x.select(common_columns).union(y.select(common_columns)), data
)

                                                                                

#### Using SQL-style expressions in PySpark

In this section, we use SQL-style expressions when appropriate to showcase when it makes sense to fuse both languages. At the end of this block, we have code that
- Selects only the useful columns for our query
- Gets our drive capacity in gigabytes
- Computes the drive_days and failures data frames
- Joins the two data frames into a summarized one and computes the failure rat

In [29]:
full_data = full_data.selectExpr(
    "model", "capacity_bytes / pow(1024, 3) capacity_GB", "date", "failure"
)

drive_days = full_data.groupby("model", "capacity_GB").agg(
    F.count("*").alias("drive_days")
)

failures = (
    full_data.where("failure = 1")
    .groupby("model", "capacity_GB")
    .agg(F.count("*").alias("failures"))
)

summarized_data = (
    drive_days.join(failures, on=["model", "capacity_GB"], how="left")
    .fillna(0.0, ["failures"])
    .selectExpr("model", "capacity_GB", "failures / drive_days failure_rate")
    .cache()
)

`selectExpr()` is just like the `select()` method with the exception that it will process SQL-style operations. This is nice since it removes a bit of syntax when manipulating columns with functions and arithmetic. In our case, the PySpark alternative (displayed in the next listing) is a little more verbose and cumbersome to write and read, especially since we have to create a literal `1024` column to apply the `pow()` function.

```python
full_data = full_data.select(
    F.col("model"),
    (F.col("capacity_bytes") / F.pow(F.lit(1024), 3)).alias("capacity_GB"),
    F.col("date"),
    F.col("failure")
)
```

The second method is simply called `expr()`. It wraps a SQL-style expression into a column. This is kind of a generalized `selectExpr()` that you can use in lieu of `F.col()` (or the column name) when you want to modify a column
```python
failures = (
    full_data.where("failure = 1")
    .groupby("model", "capacity_GB")
    .agg(F.expr("count(*) failures"))
)
```

The third method is the `where()/filter()` method. I find the syntax for filtering in SQL much less verbose than regular PySpark; being able to use the SQL syntax as the argument of the filter() method with no ceremony is a godsend. In our final program, I am able to use `full_data.where("failure = 1")` instead of wrapping the column name in `F.col()` like we’ve been doing.

In [30]:
def most_reliable_drive_for_capacity(data, capacity_GB=2048, precision=0.25, top_n=3):
    """Returns the top 3 drives for a given approximate capacity.
    Given a capacity in GB and a precision as a decimal number, we keep the N
    drives where:
    - the capacity is between (capacity * 1/(1+precision)), capacity *
    (1+precision)
    - the failure rate is the lowest
    """
    capacity_min = capacity_GB / (1 + precision)
    capacity_max = capacity_GB * (1 + precision)
    answer = (
        data.where(f"capacity_GB between {capacity_min} and {capacity_max}")
        .orderBy("failure_rate", "capacity_GB", ascending=[True, False])
        .limit(top_n)
    )
    return answer
most_reliable_drive_for_capacity(summarized_data, capacity_GB=11176.0).show()



+--------------------+-----------+--------------------+
|               model|capacity_GB|        failure_rate|
+--------------------+-----------+--------------------+
|HGST HUH721010ALE600|     9314.0|                 0.0|
|HGST HUH721212ALE600|    11176.0|2.136752136752136...|
|HGST HUH721212ALN604|    11176.0|2.150346052118244...|
+--------------------+-----------+--------------------+



                                                                                