# Pyspark installation and start on Apple Silicon

* [Installation guide to pyspark on M1 Mac](https://gist.github.com/brianspiering/1e690b593db025b5acee920fa7330366)
* [Pyspark: Exception: Java gateway process exited before sending the driver its port number](https://stackoverflow.com/a/75391117/4281353)

## Prereauisites

### JDK version
Stick to JDK 8.
```
brew install --cask adoptopenjdk8
```

### Spark Installation with Homebrew

Spark installation is expected to be done via brew as homebrew specific paths are used.

```
brew install scala
brew install apache-spark
```


In [1]:
%%html
<style>
table {float:left}
</style>

In [2]:
%%html
<style>
div.output_area pre {
    white-space: pre;
}
</style>

# PySpark DataFrame Getting Started

* [Spark SQL](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html) (MUST)

## Spark SQL Core Classes

Note that DataFrame is SparkSQL class.


| SparkSession(sparkContext[, jsparkSession]) | The entry point to programming Spark with the Dataset and DataFrame API. |
|:---|:---|
| Catalog(sparkSession) | User-facing catalog API, accessible through SparkSession.catalog. |
| DataFrame(jdf, sql_ctx) | A distributed collection of data grouped into named columns. |
| Column(jc) | A column in a DataFrame. |
| Row | A row in DataFrame. |
| GroupedData(jgd, df) | A set of methods for aggregations on a DataFrame, created by DataFrame.groupBy(). |
| PandasCogroupedOps(gd1, gd2) | A logical grouping of two GroupedData, created by GroupedData.cogroup(). |
| DataFrameNaFunctions(df) | Functionality for working with missing data in DataFrame. |
| DataFrameStatFunctions(df) | Functionality for statistic functions with DataFrame. |
| Window | Utility functions for defining window in DataFrames. |


In [3]:
import os
import sys
import gc
import numpy as np

# Constant

In [4]:
SPARK_HOME = "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec"
JAVA_HOME = '/opt/homebrew/opt/openjdk'

#  Environemnt Variables



## PYTHONPATH

Refer to the **pyspark** modules to load from the ```$SPARK_HOME/python/lib``` in the Spark installation.

* [PySpark Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)

> Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib. One example of doing this is shown below:

```
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
```

Alternatively install **pyspark** with pip or conda locally which installs the Spark runtime libararies (for standalone).

* [Can PySpark work without Spark?](https://stackoverflow.com/questions/51728177/can-pyspark-work-without-spark)

> As of v2.2, executing pip install pyspark will install Spark. If you're going to use Pyspark it's clearly the simplest way to get started. On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars  
> PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark. This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

## SPARK_HOME

SPARK_HOME must be set to /opt/homebrew/Cellar/apache-spark/3.3.1/**libexec**",  **NOT** /opt/homebrew/Cellar/apache-spark/3.3.1". 

Otherwise **Java gateway process exited before sending its port number** in java_gateway.py


In [5]:
# --------------------------------------------------------------------------------
# Environment Variables
# --------------------------------------------------------------------------------
os.environ['SPARK_HOME'] = SPARK_HOME
os.environ['JAVA_HOME'] = JAVA_HOME
sys.path.extend([
    f"{SPARK_HOME}/python/lib/py4j-0.10.9.5-src.zip",
    f"{SPARK_HOME}/python/lib/pyspark.zip",
])


## PySpark package imports

Execute after the PYTHONPATH setup.

## PYSPARK_PYTHON

In [6]:
import pyspark.sql 
from pyspark.sql.types import *
from pyspark.sql.functions import (
    col,
    when,
    avg,
    stddev,
    isnan,
    round,
    to_date,
    date_format,
    from_unixtime,
)

# Data

* [The UC Irvine Machine Learning Repository  - Record Linkage Comparison Patterns Data Set](https://archive.ics.uci.edu/ml/datasets/Record+Linkage+Comparison+Patterns)

The data are pairs of patient records to identify the two records refer to the same patient or not (na-yose in Japanse).It is from the record linkage study performed at a hospital in 2010 analyzing pairs of patient records that were matched according to several different criteria, such as the patient’s name (first and last), address, and birthday. 

Each matching field was assigned a numerical score from 0.0 to 1.0 based on how similar the strings were, and the data was then hand-labeled to identify which pairs represented the same person and which did not. 


| feature | description  |
|:---------|:--------------|
| is_match| if the pair is a match or not (1: match)          |
| cmp_sex | if the gender of the pair is a match (1:match)             |
|         |              |
|         |              |

In [7]:
%%bash
rm -rf ./data/linkage
mkdir -p ./data/linkage
cd ./data/linkage/
curl -L -o donation.zip https://bit.ly/1Aoywaq
unzip -o donation.zip
unzip -o 'block_*.zip'
rm -rf *.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   163  100   163    0     0    617      0 --:--:-- --:--:-- --:--:--   624
100 53.8M  100 53.8M    0     0  6588k      0  0:00:08  0:00:08 --:--:-- 8829k


Archive:  donation.zip
 extracting: block_10.zip            
 extracting: block_1.zip             
 extracting: block_2.zip             
 extracting: block_3.zip             
 extracting: block_4.zip             
 extracting: block_5.zip             
 extracting: block_6.zip             
 extracting: block_7.zip             
 extracting: block_8.zip             
 extracting: block_9.zip             
  inflating: documentation           
  inflating: frequencies.csv         
Archive:  block_3.zip
  inflating: block_3.csv             

Archive:  block_2.zip
  inflating: block_2.csv             

Archive:  block_1.zip
  inflating: block_1.csv             

Archive:  block_5.zip
  inflating: block_5.csv             

Archive:  block_4.zip
  inflating: block_4.csv             

Archive:  block_6.zip
  inflating: block_6.csv             

Archive:  block_10.zip
  inflating: block_10.csv            

Archive:  block_7.zip
  inflating: block_7.csv             

Archive:  block_9.zip
  inflatin


10 archives were successfully processed.


* [New York Times Best Sellers - Hardcover Fiction Best Sellers from 2008 to 2018](https://www.kaggle.com/cmenca/new-york-times-hardcover-fiction-best-sellers)

---
# Spark Session


In [8]:
from pyspark.sql import SparkSession

In [9]:
spark = SparkSession.builder\
    .master('local[*]') \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/02/09 07:31:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


To avoid ```You may get a different result due to the upgrading to Spark >= 3.0: Fail to parse '2008-06-22 10:00:00' in the new parser.```.

* [String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize](https://stackoverflow.com/questions/62602720)

In [49]:
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

DataFrame[key: string, value: string]

---
# Read CSV

* [DataFrameReader/Writer API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output)

* [SparkSQL CSV Files](https://spark.apache.org/docs/latest/sql-data-sources-csv.html)

> Spark SQL provides spark.read().csv("file_name") to read a file or directory of files in CSV format into Spark DataFrame, and dataframe.write().csv("path") to write to a CSV file. Function option() can be used to customize the behavior of reading or writing.

[SparkSession.read()](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html#read--) returns [DataFrameReader](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html) instance which has [option](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#option-java.lang.String-boolean-) method by which we can specify CSV options.

The options are listed in [Data Source Option](https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option)

In [10]:
prev = spark.read.csv("data/linkage")
prev.printSchema()
del prev

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)



## CSV Options

* Schema Inference 
* Null value replacement
* Header handling

In [11]:
parsed = spark.read\
    .option("header", True)\
    .option("nullValue", "?")\
    .option("inferSchema", True)\
    .csv("data/linkage")

parsed.printSchema()
parsed.show(5, truncate=False)



root
 |-- id_1: string (nullable = true)
 |-- id_2: string (nullable = true)
 |-- cmp_fname_c1: string (nullable = true)
 |-- cmp_fname_c2: string (nullable = true)
 |-- cmp_lname_c1: string (nullable = true)
 |-- cmp_lname_c2: string (nullable = true)
 |-- cmp_sex: string (nullable = true)
 |-- cmp_bd: string (nullable = true)
 |-- cmp_bm: string (nullable = true)
 |-- cmp_by: integer (nullable = true)
 |-- cmp_plz: integer (nullable = true)
 |-- is_match: boolean (nullable = true)

+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
|id_1 |id_2 |cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
|3148 |8326 |1           |null        |1           |null        |1      |1     |1     |1     |1      |true    |
|14055|94934|1           |null        |1           |null       

                                                                                

---
# Exploratory Analysis

In [12]:
parsed.count()

5749213

## Estimates

In [13]:
summary = parsed.describe()
summary.show()



23/02/09 07:31:16 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: 1. Title: Record Linkage Comparison Patterns , ?, ?, ?, ?, ?, ?, ?, ?, ?, ?
 Schema: id_1, id_2, cmp_fname_c1, cmp_fname_c2, cmp_lname_c1, cmp_lname_c2, cmp_sex, cmp_bd, cmp_bm, cmp_by, cmp_plz
Expected: id_1 but found: 1. Title: Record Linkage Comparison Patterns 
CSV file: file:///Users/oonisim/home/repository/git/oonisim/spark-programs/PySpark/data/linkage/documentation
23/02/09 07:31:16 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: cmp_fname_c1, cmp_fname_c2, cmp_lname_c1, cmp_lname_c2, cmp_sex, cmp_bd, cmp_bm, cmp_by, cmp_plz, ?, ?
 Schema: id_1, id_2, cmp_fname_c1, cmp_fname_c2, cmp_lname_c1, cmp_lname_c2, cmp_sex, cmp_bd, cmp_bm, cmp_by, cmp_plz
Expected: id_1 but found: cmp_fname_c1
CSV file: file:///Users/oonisim/home/repository/git/oonisim/spark-programs/PySpark/data/linkage/frequencies.csv
+-------+--------------------+-----------------+------------------

                                                                                

In [14]:
matched_summary = parsed.where(col("is_match") == True).describe()
matched_summary.show(5)



23/02/09 07:31:18 WARN CSVHeaderChecker: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 1, schema size: 12
CSV file: file:///Users/oonisim/home/repository/git/oonisim/spark-programs/PySpark/data/linkage/documentation
23/02/09 07:31:18 WARN CSVHeaderChecker: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 9, schema size: 12
CSV file: file:///Users/oonisim/home/repository/git/oonisim/spark-programs/PySpark/data/linkage/frequencies.csv
+-------+-----------------+-----------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+
|summary|             id_1|             id_2|       cmp_fname_c1|       cmp_fname_c2|        cmp_lname_c1|       cmp_lname_c2|            cmp_sex|             cmp_bd|              cmp_bm|             cmp_by|            cmp_plz|
+-------+---

                                                                                

In [15]:
unmatched_summary = parsed.where(col("is_match") == False).describe()
unmatched_summary.show(5)



23/02/09 07:31:23 WARN CSVHeaderChecker: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 1, schema size: 12
CSV file: file:///Users/oonisim/home/repository/git/oonisim/spark-programs/PySpark/data/linkage/documentation
23/02/09 07:31:23 WARN CSVHeaderChecker: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 9, schema size: 12
CSV file: file:///Users/oonisim/home/repository/git/oonisim/spark-programs/PySpark/data/linkage/frequencies.csv
+-------+------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+------------------+--------------------+
|summary|              id_1|              id_2|       cmp_fname_c1|       cmp_fname_c2|      cmp_lname_c1|       cmp_lname_c2|            cmp_sex|            cmp_bd|            cmp_bm|            cmp_by|             cmp_plz|
+-------+---------

                                                                                

## Rows with missing values

In [16]:
parsed.where(
    col("cmp_fname_c1").isNull() &
    col("cmp_fname_c2").isNull() & 
    (col("is_match") == False)
).show(5)

+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
|17186|64804|        null|        null|         0.5|        null|      1|     0|     0|     1|      0|   false|
|58872|80686|        null|        null|       0.125|        null|      0|     1|     1|     1|      0|   false|
|19093|75754|        null|        null|           1|        null|      1|     0|     0|     0|      0|   false|
|51568|69136|        null|        null|       0.625|        null|      1|     0|     0|     0|      0|   false|
|36952|63401|        null|        null|           0|        null|      1|     1|     1|     1|      0|   false|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--

In [17]:
parsed.where(
    col("cmp_fname_c2").isNull() 
).count()

[Stage 17:=====>                                                   (1 + 8) / 10]

23/02/09 07:31:24 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: ?
 Schema: cmp_fname_c2
Expected: cmp_fname_c2 but found: ?
CSV file: file:///Users/oonisim/home/repository/git/oonisim/spark-programs/PySpark/data/linkage/documentation
23/02/09 07:31:24 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: cmp_lname_c2
 Schema: cmp_fname_c2
Expected: cmp_fname_c2 but found: cmp_lname_c2
CSV file: file:///Users/oonisim/home/repository/git/oonisim/spark-programs/PySpark/data/linkage/frequencies.csv




5645512

---
# DataFrame Structure

## Comparison with HTML

| HTML       | HTML Description             | Spark       | Spark Description                                                                                                                                                                                                                                                                                                                       |   |
|------------|-------------------------|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| ```<table>```    | List of ```<tr>```            | DataFrame   | List of Row, and DataFrame is an alias of type DataSet[Row].                                                                                                                                                                                                                                                                      |   |
| ```<tr>```       | Table row. List of ```<td> ```| Row         | List of typed values.  ```Row``` has the schema of type **StructType** |   |
|            |                         | StructType  | Defines the type of a Row and a list of StructField that defines a field of a Row.                                                                                                                                                                                                                                                |   |
|```<td>```Table  | Table field             | StructField | Specify the DataType of a field of a row.                                                                                                                                                                                                                                                                                         |   |
|            |                         | DataType    |                                                                                                                                                                                                                                                                                                                                   |   |


## DataFrame Schema Definition

Instead of using **Schema Inference**, you can provide the **Schema Definition** for  ```SparkSession.read.schema(schema).csv(file)``` to use when reading the CSV file.


* [Data Types](https://spark.apache.org/docs/latest/sql-ref-datatypes.html#data-types)

```from pyspark.sql.types import *```

| Data type | Value type in Python | API to access or create a data type |  |
|:---|:---|:---|:--|
|ByteType | int or long Note: Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127. | ByteType() |  |
| ShortType | int or long Note: Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767. | ShortType() |  |
| IntegerType | int or long | IntegerType() |  |
| LongType | long Note: Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807.Otherwise, please convert data to decimal.Decimal and use DecimalType. | LongType() |  |
| FloatType | float Note: Numbers will be converted to 4-byte single-precision floating point numbers at runtime. | FloatType() |  |
| DoubleType | float | DoubleType() |  |
| DecimalType | decimal.Decimal | DecimalType() |  |
| StringType | string | StringType() |  |
| BinaryType | bytearray | BinaryType() |  |
| BooleanType | bool | BooleanType() |  |
| TimestampType | datetime.datetime | TimestampType() |  |
| DateType | datetime.date | DateType() |  |
| ArrayType | list, tuple, or array | ArrayType(elementType, [containsNull]) Note:The default value of containsNull is True. |  |
| MapType | dict | MapType(keyType, valueType, [valueContainsNull]) Note:The default value of valueContainsNull is True. |  |
| StructType | list or tuple | StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |  |
| StructField | The value type in Python of the data type of this field (For example, Int for a StructField with the data type IntegerType) | StructField(name, dataType, [nullable]) Note: The default value of nullable is True. |  |


In [18]:
schema = StructType([
    StructField("id_1", IntegerType(), False),
    StructField("id_2", StringType(), False),
    StructField("cmp_fname_c1", DoubleType(), False)
])

for element in schema:
    print(element)
    
# spark.read.schema(schema).csv("...")

StructField('id_1', IntegerType(), False)
StructField('id_2', StringType(), False)
StructField('cmp_fname_c1', DoubleType(), False)


## Row

Each row of the DataFrame is an instance of ```pyspark.sql.Row```.

In [19]:
row: pyspark.sql.Row = parsed.first()
row

Row(id_1='3148', id_2='8326', cmp_fname_c1='1', cmp_fname_c2=None, cmp_lname_c1='1', cmp_lname_c2=None, cmp_sex='1', cmp_bd='1', cmp_bm='1', cmp_by=1, cmp_plz=1, is_match=True)

## Column


In [20]:
id_1 = row['id_1']
print(f"id_1 type:{type(id_1)} value: {id_1}")

id_1 type:<class 'str'> value: 3148


---

# Caching the dataframe

The call to ```cache``` indicates that the contents of the DataFrame should be stored in memory the next time it’s computed. Spark defines a few different mechanisms, or StorageLevel values, for persisting data. cache() is shorthand for [DataFrame.persist](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.persist.html)(StorageLevel.MEMORY), which stores the rows as unserialized Java objects (NOT Python objects).

* [class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication=1)](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.StorageLevel.html)

## [RDD Persistence](https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence)

> One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
> 
> You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
> 
> In addition, each persisted RDD can be stored using a different **storage level**, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The ```cache()``` method is a shorthand for using the default storage level, which is **StorageLevel.MEMORY_ONLY** (store deserialized objects in memory). The full set of storage levels is:

In [21]:
parsed.cache()

DataFrame[id_1: string, id_2: string, cmp_fname_c1: string, cmp_fname_c2: string, cmp_lname_c1: string, cmp_lname_c2: string, cmp_sex: string, cmp_bd: string, cmp_bm: string, cmp_by: int, cmp_plz: int, is_match: boolean]

---
# Cast column type

Estimates of the data has string columns which should be numeric. Convert them to numeric.

In [22]:
summary.printSchema()

root
 |-- summary: string (nullable = true)
 |-- id_1: string (nullable = true)
 |-- id_2: string (nullable = true)
 |-- cmp_fname_c1: string (nullable = true)
 |-- cmp_fname_c2: string (nullable = true)
 |-- cmp_lname_c1: string (nullable = true)
 |-- cmp_lname_c2: string (nullable = true)
 |-- cmp_sex: string (nullable = true)
 |-- cmp_bd: string (nullable = true)
 |-- cmp_bm: string (nullable = true)
 |-- cmp_by: string (nullable = true)
 |-- cmp_plz: string (nullable = true)



In [23]:
summary.show(5)

+-------+--------------------+-----------------+--------------------+--------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+
|summary|                id_1|             id_2|        cmp_fname_c1|        cmp_fname_c2|       cmp_lname_c1|      cmp_lname_c2|           cmp_sex|             cmp_bd|            cmp_bm|            cmp_by|            cmp_plz|
+-------+--------------------+-----------------+--------------------+--------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+
|  count|             5749213|          5749158|             5748131|              103701|            5749133|              2465|           5749133|            5748338|           5748338|           5748337|            5736289|
|   mean|   33324.47979999771|66587.42400114964|  0.7129023464249419|   0.900008998936421| 0

In [24]:
for column in summary.columns[1:]:
    summary = summary.withColumn(column, summary[column].cast(DoubleType()))

summary.printSchema()

root
 |-- summary: string (nullable = true)
 |-- id_1: double (nullable = true)
 |-- id_2: double (nullable = true)
 |-- cmp_fname_c1: double (nullable = true)
 |-- cmp_fname_c2: double (nullable = true)
 |-- cmp_lname_c1: double (nullable = true)
 |-- cmp_lname_c2: double (nullable = true)
 |-- cmp_sex: double (nullable = true)
 |-- cmp_bd: double (nullable = true)
 |-- cmp_bm: double (nullable = true)
 |-- cmp_by: double (nullable = true)
 |-- cmp_plz: double (nullable = true)



# Transpose Dataframe

Spark DataFrame does not have transpose method. When the data size is small, convert first to Pandas to transpose.

In [25]:
def transpose(df: pyspark.sql.DataFrame):
    assert df.count() < 1000
    pdf = df.toPandas()
    pdf = pdf.set_index(df.columns[0]).transpose().reset_index()
    pdf = pdf.rename(columns={"index":"field"})
    pdf = pdf.rename_axis(None, axis=1)
    
    transposed: pyspark.sql.Dataframe = spark.createDataFrame(pdf)
    del pdf
    return transposed
   

transpose(summary).show()

  for column, series in pdf.iteritems():
  for column, series in pdf.iteritems():


+------------+---------+-------------------+-------------------+---+-------------------+
|       field|    count|               mean|             stddev|min|                max|
+------------+---------+-------------------+-------------------+---+-------------------+
|        id_1|5749213.0|  33324.47979999771|  23659.86139888655|NaN|             9999.0|
|        id_2|5749158.0|  66587.42400114964|  23620.50188438175|NaN|            99999.0|
|cmp_fname_c1|5748131.0| 0.7129023464249419|0.38875843950829186|NaN|2.68694413843136E-5|
|cmp_fname_c2| 103701.0|  0.900008998936421|0.27133067681523776|NaN|                1.0|
|cmp_lname_c1|5749133.0| 0.3156278513776009|0.33423361373861266|0.0|                1.0|
|cmp_lname_c2|   2465.0|  0.318296744405166| 0.3685373395187368|0.0|                1.0|
|     cmp_sex|5749133.0| 0.9550012294607436| 0.2073014119031234|0.0|                1.0|
|      cmp_bd|5748338.0|0.22446522967751065| 0.4172296957328137|0.0|                1.0|
|      cmp_bm|5748338

---
# DataFrame API

* [DataFrame APIs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#dataframe-apis) (MUST)
* [PySpark and SparkSQL Basics](https://towardsdatascience.com/pyspark-and-sparksql-basics-6cb4bf967e53)



## Conditional filtering on Column

* [Column API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#column-apis)

where() is alias of filter()

In [26]:
parsed.where(col("is_match") == True).limit(5).show()



23/02/09 07:31:29 WARN CSVHeaderChecker: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 1, schema size: 12
CSV file: file:///Users/oonisim/home/repository/git/oonisim/spark-programs/PySpark/data/linkage/documentation
23/02/09 07:31:29 WARN CSVHeaderChecker: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 9, schema size: 12
CSV file: file:///Users/oonisim/home/repository/git/oonisim/spark-programs/PySpark/data/linkage/frequencies.csv
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| 3148| 8326|           1|        null|           1|        null|      1|     1|     1|     1|      1|    true|
|14055|94934|       

                                                                                

In [27]:
parsed.where(col("id_1").isin([3148, 946, 64880])).show()

+-----+-----+------------+------------+------------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|cmp_fname_c1|cmp_fname_c2|      cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+------------+------------+------------------+------------+-------+------+------+------+-------+--------+
| 3148| 8326|           1|        null|                 1|        null|      1|     1|     1|     1|      1|    true|
|  946|71870|           1|        null|                 1|        null|      1|     1|     1|     1|      1|    true|
|64880|71676|           1|        null|                 1|        null|      1|     1|     1|     1|      1|    true|
|  946|61261|           0|        null| 0.111111111111111|        null|      0|     1|     1|     1|      0|   false|
|  946|35374|       0.125|        null|               0.5|        null|      1|     0|     0|     0|      0|   false|
|  946|39254|           0|        null| 0.42857142857142

In [28]:
parsed.where(
    col("cmp_fname_c1").isNotNull() & col("cmp_fname_c2").isNull()
).show()

+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| 3148| 8326|           1|        null|           1|        null|      1|     1|     1|     1|      1|    true|
|14055|94934|           1|        null|           1|        null|      1|     1|     1|     1|      1|    true|
|33948|34740|           1|        null|           1|        null|      1|     1|     1|     1|      1|    true|
|  946|71870|           1|        null|           1|        null|      1|     1|     1|     1|      1|    true|
|64880|71676|           1|        null|           1|        null|      1|     1|     1|     1|      1|    true|
|25739|45991|           1|        null|           1|        null|      1|     1|     1|     1|      1|  

## Conditional Annotation

```WHEN X THEN ... ELSE WHEN Y ... ELSE ...```

In [29]:
parsed.withColumn(
    "oddity",
    when(col("id_1") % 3 == 0, "1st")\
    .when(col("id_1") % 3 == 1, "2nd")\
    .when(col("id_1") % 3 == 2, "3rd")\
    .otherwise("last") 
)\
.select("id_1", "id_2", "is_match", "oddity")\
.show(5)

+-----+-----+--------+------+
| id_1| id_2|is_match|oddity|
+-----+-----+--------+------+
| 3148| 8326|    true|   2nd|
|14055|94934|    true|   1st|
|33948|34740|    true|   1st|
|  946|71870|    true|   2nd|
|64880|71676|    true|   3rd|
+-----+-----+--------+------+
only showing top 5 rows



## Sort

In [30]:
parsed.orderBy("id_1").limit(5).show()

+--------------------+----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
|                id_1|id_2|cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+--------------------+----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
|                 ...|null|        null|        null|        null|        null|   null|  null|  null|  null|   null|    null|
|             achi...|null|        null|        null|        null|        null|   null|  null|  null|  null|   null|    null|
|             link...|null|        null|        null|        null|        null|   null|  null|  null|  null|   null|    null|
|             that...|null|        null|        null|        null|        null|   null|  null|  null|  null|   null|    null|
|          -- A ne...|null|        null|        null|        null|        null|   null|  null|  null|  null|   null|  

## Group By and Aggregation

### Column name reference
Two ways we can reference the names of the columns in the DataFrame: either as literal strings, like in groupBy("is_match"), or as Column objects by using the "col()" function. Need to use the ```col ```function to call the ```desc``` method.

In [31]:
parsed.groupby('is_match').count().orderBy(col("count").desc()).show()

+--------+-------+
|is_match|  count|
+--------+-------+
|   false|5728201|
|    true|  20931|
|    null|     81|
+--------+-------+



In [32]:
parsed.groupby('is_match').agg(
    round(avg("cmp_plz"),2),
    round(stddev("cmp_sex"),2)
)\
.withColumnRenamed("round(avg(cmp_plz), 2)", "cmp_plz_mean")\
.withColumnRenamed("round(stddev_samp(cmp_sex), 2)", "cmp_sex_mean")\
.orderBy(col("is_match").desc()).show()



+--------+------------+------------+
|is_match|cmp_plz_mean|cmp_sex_mean|
+--------+------------+------------+
|    true|        0.96|        0.11|
|   false|         0.0|        0.21|
|    null|        null|        null|
+--------+------------+------------+



                                                                                

In [33]:
parsed\
    .groupby('is_match')\
    .agg({
        "cmp_plz": "avg",
        "cmp_sex": "stddev"
    })\
    .withColumnRenamed("avg(cmp_plz)", "cmp_plz_mean")\
    .orderBy(col("is_match").desc()
).show()

+--------+--------------------+-------------------+
|is_match|        cmp_plz_mean|    stddev(cmp_sex)|
+--------+--------------------+-------------------+
|    true|  0.9584250310975027|0.11201570591216435|
|   false|0.002043781112285135|0.20755988859217375|
|    null|                null|               null|
+--------+--------------------+-------------------+





## Correlation

In [34]:
parsed.corr("cmp_fname_c1", "cmp_fname_c2")

IllegalArgumentException: requirement failed: Currently correlation calculation for columns with dataType string not supported.

## Clearning data

### Remove duplicates

* draop_duplicates()
* distinct()

In [35]:
prased = parsed.drop_duplicates()

### Imputate NA values

In [36]:
parsed = parsed.fillna(0)
parsed.show(3)

+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| 3148| 8326|           1|        null|           1|        null|      1|     1|     1|     1|      1|    true|
|14055|94934|           1|        null|           1|        null|      1|     1|     1|     1|      1|    true|
|33948|34740|           1|        null|           1|        null|      1|     1|     1|     1|      1|    true|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
only showing top 3 rows



---
# SQL

In [37]:
parsed.createOrReplaceTempView("linkage")

In [38]:
query = """
SELECT
    COUNT(is_match) AS cnt,
    ROUND(AVG(cmp_plz),5) AS avg_plz,
    STD(cmp_sex)
FROM linkage
GROUP BY is_match
"""

spark.sql(query).show(5)

+-------+-------+-------------------+
|    cnt|avg_plz|       std(cmp_sex)|
+-------+-------+-------------------+
|  20931| 0.9571|0.11201570591216435|
|5728201|0.00204|0.20755988859217375|
|      0|    0.0|               null|
+-------+-------+-------------------+





## Join

Calculate the diffence of mean values of the fields between matched records and unmatched records to identify the correlations.

In [39]:
matched_summary_transposed = transpose(matched_summary)
matched_summary_transposed.show(3)

unmatched_summary_transposed = transpose(unmatched_summary)
unmatched_summary_transposed.show(3)

matched_summary_transposed.createOrReplaceTempView("matched")
unmatched_summary_transposed.createOrReplaceTempView("unmatched")

+------------+-----+------------------+-------------------+-----+-----+
|       field|count|              mean|             stddev|  min|  max|
+------------+-----+------------------+-------------------+-----+-----+
|        id_1|20931| 34575.72117911232|  21950.31285196913|10001|99946|
|        id_2|20931| 51259.95939037791|  24345.73345377519|10010|99996|
|cmp_fname_c1|20922|0.9973163859635038|0.03650667584833679|    0|    1|
+------------+-----+------------------+-------------------+-----+-----+
only showing top 3 rows

+------------+-------+------------------+-------------------+-----+-----+
|       field|  count|              mean|             stddev|  min|  max|
+------------+-------+------------------+-------------------+-----+-----+
|        id_1|5728201|33319.913548075565| 23665.760130330676|    1| 9999|
|        id_2|5728201| 66643.44259218557| 23599.551728241313|10000|99999|
|cmp_fname_c1|5727203|0.7118634802175091|0.38908060096985553|    0|    1|
+------------+-------+-----

  for column, series in pdf.iteritems():
  for column, series in pdf.iteritems():
  for column, series in pdf.iteritems():
  for column, series in pdf.iteritems():


In [40]:
query = """
SELECT 
    m.count AS matched_count,
    u.count AS unmatch_count,
    m.count + u.count as total,
    ROUND(m.mean - u.mean, 3) as mean_delta
FROM
    matched AS m 
    INNER JOIN unmatched u ON m.field = u.field
WHERE
    m.field NOT IN ('id_1', 'id_2')
ORDER BY 
    mean_delta desc
"""
spark.sql(query).show()

+-------------+-------------+---------+----------+
|matched_count|unmatch_count|    total|mean_delta|
+-------------+-------------+---------+----------+
|        20902|      5715387|5736289.0|     0.956|
|          475|         1989|   2464.0|     0.806|
|        20925|      5727412|5748337.0|     0.776|
|        20925|      5727412|5748337.0|     0.775|
|        20931|      5728201|5749132.0|     0.684|
|        20925|      5727412|5748337.0|     0.511|
|        20922|      5727203|5748125.0|     0.285|
|         1333|       102365| 103698.0|     0.091|
|        20931|      5728201|5749132.0|     0.032|
+-------------+-------------+---------+----------+



---

# Read JSON

* [SparkSQL Guide - JSON Files](https://spark.apache.org/docs/latest/sql-data-sources-json.html)

In [41]:
preview = spark.read\
    .option("compression", "none")\
    .option("inferSchema", True)\
    .json("./data/books/nyt2.json")

preview.printSchema()
del preview

root
 |-- _id: struct (nullable = true)
 |    |-- $oid: string (nullable = true)
 |-- amazon_product_url: string (nullable = true)
 |-- author: string (nullable = true)
 |-- bestsellers_date: struct (nullable = true)
 |    |-- $date: struct (nullable = true)
 |    |    |-- $numberLong: string (nullable = true)
 |-- description: string (nullable = true)
 |-- price: struct (nullable = true)
 |    |-- $numberDouble: string (nullable = true)
 |    |-- $numberInt: string (nullable = true)
 |-- published_date: struct (nullable = true)
 |    |-- $date: struct (nullable = true)
 |    |    |-- $numberLong: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- rank: struct (nullable = true)
 |    |-- $numberInt: string (nullable = true)
 |-- rank_last_week: struct (nullable = true)
 |    |-- $numberInt: string (nullable = true)
 |-- title: string (nullable = true)
 |-- weeks_on_list: struct (nullable = true)
 |    |-- $numberInt: string (nullable = true)



In [43]:
books = spark\
    .read\
    .option("compression", "none")\
    .option("inferSchema", True)\
    .json("./data/books/nyt2.json")\
    .select(
        "title",
        "author",
        col("price.$numberDouble").alias("price"),    
        to_date(
            from_unixtime(col("published_date.$date.$numberLong") / 1000),
            "yyyy-MM-dd"
        ).alias("published"),
        to_date(
            from_unixtime(col("bestsellers_date.$date.$numberLong") / 1000),
            "yyyy-MM-dd"
        ).alias("best_seller_date"),
        col("amazon_product_url").alias("url")
    )
books.printSchema()
books.show(5, truncate=False)

root
 |-- title: string (nullable = true)
 |-- author: string (nullable = true)
 |-- price: string (nullable = true)
 |-- published: date (nullable = true)
 |-- best_seller_date: date (nullable = true)
 |-- url: string (nullable = true)

23/02/09 07:34:30 ERROR Executor: Exception in task 0.0 in stage 65.0 (TID 197)
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading to Spark >= 3.0: Fail to parse '2008-06-08 10:00:00' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failToParseDateTimeInNewParserError(QueryExecutionErrors.scala:1084)
	at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:148)
	at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyO

Py4JJavaError: An error occurred while calling o383.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 65.0 failed 1 times, most recent failure: Lost task 0.0 in stage 65.0 (TID 197) (192.168.1.104 executor driver): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading to Spark >= 3.0: Fail to parse '2008-06-08 10:00:00' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failToParseDateTimeInNewParserError(QueryExecutionErrors.scala:1084)
	at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:148)
	at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.parse(TimestampFormatter.scala:176)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: java.time.format.DateTimeParseException: Text '2008-06-08 10:00:00' could not be parsed, unparsed text found at index 10
	at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2109)
	at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1934)
	at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.parse(TimestampFormatter.scala:168)
	... 17 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading to Spark >= 3.0: Fail to parse '2008-06-08 10:00:00' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failToParseDateTimeInNewParserError(QueryExecutionErrors.scala:1084)
	at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:148)
	at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.parse(TimestampFormatter.scala:176)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	... 1 more
Caused by: java.time.format.DateTimeParseException: Text '2008-06-08 10:00:00' could not be parsed, unparsed text found at index 10
	at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2109)
	at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1934)
	at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.parse(TimestampFormatter.scala:168)
	... 17 more


## Extract nested JSON elements

In [50]:
books = spark\
    .read\
    .option("compression", "none")\
    .option("inferSchema", True)\
    .json("./data/books/nyt2.json")\
    .select(
        "title",
        "author",
        col("price.$numberDouble").alias("price"),    
        to_date(
            from_unixtime(col("published_date.$date.$numberLong") / 1000),
            "yyyy-MM-dd"
        ).alias("published"),
        to_date(
            from_unixtime(col("bestsellers_date.$date.$numberLong") / 1000),
            "yyyy-MM-dd"
        ).alias("best_seller_date"),
        col("amazon_product_url").alias("url")
    )
books.printSchema()
books.show(truncate=False)

root
 |-- title: string (nullable = true)
 |-- author: string (nullable = true)
 |-- price: string (nullable = true)
 |-- published: date (nullable = true)
 |-- best_seller_date: date (nullable = true)
 |-- url: string (nullable = true)

+--------------------------------------------------+----------------------------------------+-----+----------+----------------+--------------------------------------------------------------------------------------------------+
|title                                             |author                                  |price|published |best_seller_date|url                                                                                               |
+--------------------------------------------------+----------------------------------------+-----+----------+----------------+--------------------------------------------------------------------------------------------------+
|ODD HOURS                                         |Dean R Koontz                

## Conditional String Filering on Column 

In [51]:
books.where(col("author").like("Alan%")).show(10)

+--------------------+------------+-----+----------+----------------+--------------------+
|               title|      author|price| published|best_seller_date|                 url|
+--------------------+------------+-----+----------+----------------+--------------------+
| THE SPIES OF WARSAW|  Alan Furst| null|2008-06-22|      2008-06-07|http://www.amazon...|
| THE SPIES OF WARSAW|  Alan Furst| null|2008-06-29|      2008-06-14|http://www.amazon...|
| THE SPIES OF WARSAW|  Alan Furst| null|2008-07-06|      2008-06-21|http://www.amazon...|
|SPIES OF THE BALKANS|  Alan Furst| null|2010-07-04|      2010-06-20|http://www.amazon...|
|SPIES OF THE BALKANS|  Alan Furst| null|2010-07-11|      2010-06-27|http://www.amazon...|
|A RED HERRING WIT...|Alan Bradley| null|2011-02-27|      2011-02-12|http://www.amazon...|
|I AM HALF-SICK OF...|Alan Bradley| null|2011-11-20|      2011-11-05|http://www.amazon...|
|    MISSION TO PARIS|  Alan Furst| null|2012-07-01|      2012-06-16|http://www.amazon...|

In [52]:
books.where(col("author").rlike("^Alan [BF].*")).show(10)

+--------------------+------------+-----+----------+----------------+--------------------+
|               title|      author|price| published|best_seller_date|                 url|
+--------------------+------------+-----+----------+----------------+--------------------+
| THE SPIES OF WARSAW|  Alan Furst| null|2008-06-22|      2008-06-07|http://www.amazon...|
| THE SPIES OF WARSAW|  Alan Furst| null|2008-06-29|      2008-06-14|http://www.amazon...|
| THE SPIES OF WARSAW|  Alan Furst| null|2008-07-06|      2008-06-21|http://www.amazon...|
|SPIES OF THE BALKANS|  Alan Furst| null|2010-07-04|      2010-06-20|http://www.amazon...|
|SPIES OF THE BALKANS|  Alan Furst| null|2010-07-11|      2010-06-27|http://www.amazon...|
|A RED HERRING WIT...|Alan Bradley| null|2011-02-27|      2011-02-12|http://www.amazon...|
|I AM HALF-SICK OF...|Alan Bradley| null|2011-11-20|      2011-11-05|http://www.amazon...|
|    MISSION TO PARIS|  Alan Furst| null|2012-07-01|      2012-06-16|http://www.amazon...|

In [53]:
books.where(
    col("author").startswith("Alan") & col("author").endswith("Bradley")
).show(5)

+--------------------+------------+-----+----------+----------------+--------------------+
|               title|      author|price| published|best_seller_date|                 url|
+--------------------+------------+-----+----------+----------------+--------------------+
|A RED HERRING WIT...|Alan Bradley| null|2011-02-27|      2011-02-12|http://www.amazon...|
|I AM HALF-SICK OF...|Alan Bradley| null|2011-11-20|      2011-11-05|http://www.amazon...|
|SPEAKING FROM AMO...|Alan Bradley| null|2013-02-17|      2013-02-02|http://www.amazon...|
|THE DEAD IN THEIR...|Alan Bradley| null|2014-02-02|      2014-01-18|http://www.amazon...|
|AS CHIMNEY SWEEPE...|Alan Bradley| null|2015-01-25|      2015-01-10|http://www.amazon...|
+--------------------+------------+-----+----------+----------------+--------------------+
only showing top 5 rows



---
# Stop Spark Session

In [54]:
spark.stop()



# Cleanup

In [55]:
del spark
gc.collect()

4289