# Working with Structured Data

**Structured data** is data that conforms to a formal structure. In relational database terms, such a structure is referred to as a **schema**, which includes a formal description of tables, fields and relationships.

![Structured data example](https://upload.wikimedia.org/wikipedia/commons/6/67/Data_model_in_UML.png)

*Structured data example: A database schema described by an UML diagram, Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Data_model_in_UML.png)*

In [None]:
import findspark
findspark.init()
import pyspark

## Spark SQL

**Spark SQL** is the Spark module for structured data processing. The name is slightly misleading, because Spark SQL deals not only with the SQL query language, but structured data in general: Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the DataFrame API. 

We start our Spark application with the creation of a `SparkSession`. It is the entry point to programming Spark with the SQL and DataFrame API, just as a `SparkContext` is the entry point for programming with the RDD API.

In [None]:
spark = pyspark.sql.SparkSession \
                    .builder \
                    .appName("Spark SQL First Example") \
                    .getOrCreate()

As with a `SparkContext`, our program should always end with a clean exit from the `SparkSession`:

In [None]:
spark.stop()

Let's wrap that in a context manager to make sure we don't forget the clean exit when running a bit of Spark code.

In [None]:
from contextlib import contextmanager

@contextmanager
def use_spark_session(appName):
    spark_session = pyspark.sql.SparkSession.builder.appName(appName).getOrCreate()
    try:
        print("starting ", appName)
        yield spark_session
    finally:
        spark_session.stop()
        print("stopping ", appName)

In [None]:
with use_spark_session("Quick Example") as spark:
    df = spark.range(1e10).toDF("id")
    df.show(5)

## Spark DataFrames

A Spark DataFrame is an immutable, distributed collection of data, organized into named columns. It shares many features with RDDs: A DataFrame is _distributed_. It is _immutable_ but can be operated on via _transformations_ that create new DataFrames. Evaluation of transformations is _lazy_.

For the following examples we start an interactive `SparkSession`:

In [None]:
spark = pyspark.sql.SparkSession \
                    .builder \
                    .appName("Spark DataFrame Examples") \
                    .getOrCreate()

We have already used the [`RDD.toDF`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=todf#pyspark.sql.DataFrame.toDF) method above. The method converts an RDD to a DataFrame, accepting column names as input:

In [None]:
df = spark.range(1000).toDF("id")

In [None]:
df.show(5)

Let's explore some operations on DataFrames by working with some real data - a table of US zip codes and associated information. It is provided as a [JSON](https://en.wikipedia.org/wiki/JSON) file:

In [None]:
!head ../.assets/data/zipcodes/zips.json

In [None]:
df = spark.read.json("../.assets/data/zipcodes/zips.json")

In [None]:
df

In [None]:
df.show()

A DataFrame is made up of columns and rows and has a schema. A quick look at those components:

**Schema**

The DataFrame's schema defines the names and types of columns. Available data types can be found in `pyspark.sql.types`.

In [None]:
df.schema

In [None]:
df.printSchema()

**Rows**

Get the number of rows (and note that this too is an _action_ that triggers distributed computation):

In [None]:
df.count()

Look at a single row object and get values from its fields:

In [None]:
head_row = df.head()

In [None]:
head_row

In [None]:
head_row["city"]

**Columns**

List all columns:

In [None]:
df.columns

Select a column:

In [None]:
df["pop"]

Temporarily rename a column:

In [None]:
df["pop"].alias("population_size")

In [None]:
df

Renaming a column by transformation:

In [None]:
df = df.withColumnRenamed("pop", "population_size")

In [None]:
df.show(5)

In [None]:
df.rdd.filter(lambda row: row["population_size"] > 1e5).collect()

### DataFrame Operations

In the following we go through a couple of frequently needed operations on DataFrames.

**Selecting**


Selecting rows by value in a column:

In [None]:
df[df["city"] == "SPRINGFIELD"].show()

Filtering by value range of a column:

In [None]:
df.filter(df["population_size"] > 1e5).show()

In [None]:
df.filter?

Equivalent, but using a `pandas`-style syntax:

In [None]:
df[df["population_size"] > 1e5].show()

Selecting columns:

In [None]:
df[["_id", "population_size"]].show(5)

**Grouping**

The `groupBy` method creates a special `GroupedData` object. To create a grouped DataFrame, we need to add a method that specifies how to aggregate the grouped data, such as `count` or `sum`.

In [None]:
df.groupBy("state")

In [None]:
df.groupBy("state").count().show(10)

In [None]:
df.groupBy("state").sum().show(10)

**Extending**

Adding a new column with values derived from an existing one:

In [None]:
df.withColumn("population_thousands", df["population_size"] / 1000).show()

### Running SQL Queries

[SQL](https://en.wikipedia.org/wiki/SQL) is a domain-specific language for handling structured data, and is very common in the context of databases. You usually have the choice whether to express operations on Spark DataFrames directly via the API or as SQL statements. This is mainly a matter of preference and familiarity. However, a SQL statement may be more readable and more efficient depending on the specific case.

SQL queries can be send to Spark by calling `spark.sql`. Any DataFrame that is to be available as a table to query needs to be registered beforehand. 

In [None]:
df.createOrReplaceTempView("zipcodes")

sqlResult = spark.sql("SELECT * FROM zipcodes")
sqlResult.show()

## Exercises: Wrangling the Zipcode DataFrame

a) Output a table of the total population of each state in descending order

In [None]:
# Your turn:

b)  Show the US zip code areas north of the 49th parallel with more than 1000 inhabitans.

In [None]:
# Your turn:

### Statistics on DataFrames

Here we briefly discuss how to compute summary statistics on numerical columns of a DataFrame. As expected, the DataFrame has a `describe` method that summarizes the distributions of the values contained:

In [None]:
df[["population_size"]].describe().show()



Let's see that on some randomly generated data:

In [None]:
from pyspark.sql.functions import randn, rand
rand_df = spark.range(1000).withColumn("normal", randn()).withColumn("uniform", rand())
rand_df.show()

In [None]:
rand_df[["normal", "uniform"]].describe().show()

The `DataFrameStatFunctions` object `df.stat` gives access to more statistics measures, such as the Pearson correlation coefficient:

In [None]:
rand_df.stat.corr("uniform", "normal")

### Reading and Writing DataFrames

DataFrames can be read from and written to several common file formats:

**CSV**

The `DataFrame.write.csv` method writes a DataFrame to disk in the CSV format. However, it does create one file per partition:

In [None]:
df[["_id", "city", "population_size", "state"]] \
    .repartition(2) \
    .write \
    .csv("../.assets/temp/zips", mode="overwrite")

In [None]:
!ls ../.assets/temp/zips

Since we are working only with a small example table, let's use Pandas for convenience:

In [None]:
df[["_id", "city", "population_size", "state"]] \
    .toPandas() \
    .to_csv("../.assets/temp/zips.csv", index=False)

In [None]:
!head ../.assets/temp/zips.csv

Conversely, the `SparkSession.read.csv` function reads the file back into a DataFrame. However, we need to give some additional schema information to get back the format we want.

In [None]:
spark.read.csv("../.assets/temp/zips.csv")

In [None]:
spark.read.csv("../.assets/temp/zips.csv", 
               header=True, 
               schema="_id STRING, city STRING, population_size INT, state STRING")

**JSON**

In [None]:
spark.read.json("../.assets/data/zipcodes/zips.json")

In [None]:
spark.read.json("../.assets/data/zipcodes/zips.json").write.json("../.assets/temp/zips.json", mode="overwrite")

In [None]:
!ls ../.assets/temp/zips.json

**Parquet**

[Apache Parquet](https://parquet.apache.org/) is an efficient binary format for tabular data. It is a compressed format with a low storage footprint and fast read and write speeds. (If you are interested, read more [on the performance of Parquet](https://tech.blue-yonder.com/efficient-dataframe-storage-with-apache-parquet/)).

In [None]:
parquet_path = "../.assets/temp/zips.parquet"

In [None]:
df.write.parquet(parquet_path, mode="overwrite", compression="gzip")

In [None]:
spark.read.parquet(parquet_path).show(5)

We can also directly query a Parquet file with SQL:

In [None]:
spark.sql(f"SELECT * FROM parquet.`{parquet_path}` ORDER BY population_size DESC").show(5)

### Spark and Pandas

For some projects, we might want to combine Spark and Pandas. We need to be aware that once we switch to `pandas.DataFrame`, data and computation are restricted to a single node of the cluster, and we lose distributed parallelism. Yet, there can be cases in which that is acceptable.

**Differences**

A goal for Spark's DataFrame API is to mimic `pandas` as much as possible. Still, a few minor, syntactic as well as major, conceptual differences remain. We need to keep in mind that a Spark DataFrame is a distributed collection possibly partitioned over many nodes of the cluster, not a random-access data structure that sits neatly in local memory, like a `pandas.DataFrame`. A major difference is that a Spark DataFrame has no index.

It is possible to combine `pandas` and `pyspark` and convert between their DataFrame types:

In [None]:
pandas_df = df.toPandas()

In [None]:
pandas_df.head()

In [None]:
spark_df = spark.createDataFrame(pandas_df)

In [None]:
spark_df.show()

**Efficient conversion with Apache Arrow**

As shown above, conversion to `pandas.DataFrame` works out of the box, but it can be slow for large dataframes. The reasons for this are quite [technical](https://bryancutler.github.io/toPandas/), having to do with how the Java Virtual Machine, on which Spark is running, exchanges data with a Python process. Without going to into the details: When moving large dataframes between Spark and `pandas`, it is recommended to use the **[Apache Arrow](https://arrow.apache.org/)** engine for columnar data by setting the respective Spark configuration parameter. Compare the running times of the following two examples:

In [None]:
from pyspark.sql.functions import rand
n_samples = 5e6

In [None]:
%%time
with use_spark_session("Conversion to Pandas") as spark:
    print("using Arrow: ", spark.conf.get("spark.sql.execution.arrow.enabled"))
    df = spark.range(n_samples).toDF("id").withColumn("value", rand())
    pandas_df = df.toPandas()

In [None]:
%%time
with use_spark_session("Conversion to Pandas") as spark:
    spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    print("using Arrow: ", spark.conf.get("spark.sql.execution.arrow.enabled"))
    df = spark.range(n_samples).toDF("id").withColumn("x", rand())
    pandas_df = df.toPandas()

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_