# PySpark DataFrame Getting Started

In [3]:
%%html
<style>
table {float:left}
</style>

In [67]:
%%html
<style>
div.output_area pre {
    white-space: pre;
}
</style>

In [2]:
import os
import sys
import gc

#  Environemnt Variables

## Hadoop

In [4]:
os.environ['HADOOP_CONF_DIR'] = "/opt/hadoop/hadoop-3.2.2/etc/hadoop"

In [5]:
%%bash
export HADOOP_CONF_DIR="/opt/hadoop/hadoop-3.2.2/etc/hadoop"
ls $HADOOP_CONF_DIR | head -n 5

capacity-scheduler.xml
configuration.xsl
container-executor.cfg
core-site.xml
core-site.xml.48132.2022-02-15@12:29:41~


## PYTHONPATH

Refer to the **pyspark** modules to load from the ```$SPARK_HOME/python/lib``` in the Spark installation.

* [PySpark Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)

> Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib. One example of doing this is shown below:

```
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
```

Alternatively install **pyspark** with pip or conda locally which installs the Spark runtime libararies (for standalone).

* [Can PySpark work without Spark?](https://stackoverflow.com/questions/51728177/can-pyspark-work-without-spark)

> As of v2.2, executing pip install pyspark will install Spark. If you're going to use Pyspark it's clearly the simplest way to get started. On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars  
> PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark. This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

In [6]:
# os.environ['PYTHONPATH'] = "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip:/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
sys.path.extend([
    "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip",
    "/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
])

## PySpark package imports

Execute after the PYTHONPATH setup.

In [69]:
import pyspark.sql 
from pyspark.sql.types import *
from pyspark.sql.functions import (
    col,
    avg,
    stddev,
    isnan
)

# Data

* [The UC Irvine Machine Learning Repository  - Record Linkage Comparison Patterns Data Set](https://archive.ics.uci.edu/ml/datasets/Record+Linkage+Comparison+Patterns)

The data are pairs of patient records to identify the two records refer to the same patient or not (na-yose in Japanse).It is from the record linkage study performed at a hospital in 2010 analyzing pairs of patient records that were matched according to several different criteria, such as the patient’s name (first and last), address, and birthday. 

Each matching field was assigned a numerical score from 0.0 to 1.0 based on how similar the strings were, and the data was then hand-labeled to identify which pairs represented the same person and which did not. 


| feature | description  |
|:---------|:--------------|
| is_match| if the pair is a match or not (1: match)          |
| cmp_sex | if the gender of the pair is a match (1:match)             |
|         |              |
|         |              |

In [17]:
%%bash
mkdir -p ./data/linkage
cd ./data/linkage/
curl -L -o donation.zip https://bit.ly/1Aoywaq
unzip -o donation.zip
unzip -o 'block_*.zip'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   163  100   163    0     0    545      0 --:--:-- --:--:-- --:--:--   543
100 53.8M  100 53.8M    0     0  7688k      0  0:00:07  0:00:07 --:--:-- 10.3M


Archive:  donation.zip
 extracting: block_10.zip            
 extracting: block_1.zip             
 extracting: block_2.zip             
 extracting: block_3.zip             
 extracting: block_4.zip             
 extracting: block_5.zip             
 extracting: block_6.zip             
 extracting: block_7.zip             
 extracting: block_8.zip             
 extracting: block_9.zip             
  inflating: documentation           
  inflating: frequencies.csv         
Archive:  block_9.zip
  inflating: block_9.csv             

Archive:  block_4.zip
  inflating: block_4.csv             

Archive:  block_10.zip
  inflating: block_10.csv            

Archive:  block_1.zip
  inflating: block_1.csv             

Archive:  block_6.zip
  inflating: block_6.csv             

Archive:  block_5.zip
  inflating: block_5.csv             

Archive:  block_2.zip
  inflating: block_2.csv             

Archive:  block_8.zip
  inflating: block_8.csv             

Archive:  block_3.zip
  inflatin


10 archives were successfully processed.


In [20]:
%%bash
cd data/linkage/
hdfs dfs -mkdir -p linkage
hdfs dfs -put -f block_*.csv linkage

rm -rf block_* documentation frequencies.csv

---
# Spark Session


In [21]:
from pyspark.sql import SparkSession

In [22]:
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.debug.maxToStringFields', 100) \
    .config('spark.executor.memory', '2g') \
    .getOrCreate()

2022-02-19 09:21:59,289 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-19 09:22:03,324 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [65]:
NUM_CORES = 4
NUM_PARTITIONS = 3

spark.conf.set("spark.sql.shuffle.partitions", NUM_CORES * NUM_PARTITIONS)
spark.conf.set("spark.default.parallelism", NUM_CORES * NUM_PARTITIONS)
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')

# Read CSV

* [SparkSQL CSV Files](https://spark.apache.org/docs/latest/sql-data-sources-csv.html)

> Spark SQL provides spark.read().csv("file_name") to read a file or directory of files in CSV format into Spark DataFrame, and dataframe.write().csv("path") to write to a CSV file. Function option() can be used to customize the behavior of reading or writing.

[SparkSession.read()](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html#read--) returns [DataFrameReader](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html) instance which has [option](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#option-java.lang.String-boolean-) method by which we can specify CSV options.

The options are listed in [Data Source Option](https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option)

In [23]:
prev = spark.read.csv("linkage")
prev.printSchema()
del prev

                                                                                

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)



## CSV Options

* Schema Inference 
* Null value replacement
* Header handling

In [24]:
parsed = spark.read\
    .option("header", "true")\
    .option("nullValue", "?")\
    .option("inferSchema", "true")\
    .csv("linkage")

parsed.printSchema()



root
 |-- id_1: integer (nullable = true)
 |-- id_2: integer (nullable = true)
 |-- cmp_fname_c1: double (nullable = true)
 |-- cmp_fname_c2: double (nullable = true)
 |-- cmp_lname_c1: double (nullable = true)
 |-- cmp_lname_c2: double (nullable = true)
 |-- cmp_sex: integer (nullable = true)
 |-- cmp_bd: integer (nullable = true)
 |-- cmp_bm: integer (nullable = true)
 |-- cmp_by: integer (nullable = true)
 |-- cmp_plz: integer (nullable = true)
 |-- is_match: boolean (nullable = true)



                                                                                

# Exploratory Analysis

In [25]:
parsed.show(5)

+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| 3148| 8326|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|14055|94934|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|33948|34740|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|  946|71870|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|64880|71676|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--

In [26]:
parsed.count()

                                                                                

5749132

## Estimates

In [95]:
summary = parsed.describe()
summary.show()



+-------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+
|summary|              id_1|              id_2|      cmp_fname_c1|      cmp_fname_c2|       cmp_lname_c1|       cmp_lname_c2|            cmp_sex|             cmp_bd|             cmp_bm|            cmp_by|            cmp_plz|
+-------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+
|  count|           5749132|           5749132|           5748125|            103698|            5749132|               2464|            5749132|            5748337|            5748337|           5748337|            5736289|
|   mean| 33324.48559643438| 66587.43558331935|0.7129024704425707| 0.900017671890335|0.3156278193076

                                                                                

In [124]:
matched_summary = parsed.where(col("is_match") == "true").describe()
matched_summary.show(5)

+-------+------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+--------------------+-------------------+------------------+
|summary|              id_1|              id_2|       cmp_fname_c1|       cmp_fname_c2|       cmp_lname_c1|      cmp_lname_c2|            cmp_sex|             cmp_bd|              cmp_bm|             cmp_by|           cmp_plz|
+-------+------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+--------------------+-------------------+------------------+
|  count|             20931|             20931|              20922|               1333|              20931|               475|              20931|              20925|               20925|              20925|             20902|
|   mean| 34575.72117911232| 51259.95939037791| 0.9973163859635039| 0.9898900320318176| 0.99

In [126]:
unmatched_summary = parsed.where(col("is_match") == "false").describe()
unmatched_summary.show(5)



+-------+------------------+-----------------+------------------+------------------+------------------+-------------------+-------------------+------------------+------------------+-------------------+--------------------+
|summary|              id_1|             id_2|      cmp_fname_c1|      cmp_fname_c2|      cmp_lname_c1|       cmp_lname_c2|            cmp_sex|            cmp_bd|            cmp_bm|             cmp_by|             cmp_plz|
+-------+------------------+-----------------+------------------+------------------+------------------+-------------------+-------------------+------------------+------------------+-------------------+--------------------+
|  count|           5728201|          5728201|           5727203|            102365|           5728201|               1989|            5728201|           5727412|           5727412|            5727412|             5715387|
|   mean|33319.913548075565|66643.44259218557|0.7118634802163704| 0.898847351409032|0.3131380113360652|0.162

                                                                                

## Rows with missing values

In [93]:
parsed.where(
    col("cmp_fname_c1").isNull() &
    col("cmp_fname_c2").isNull() & 
    (col("is_match") == "false")
).show(5)

+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
|17186|64804|        null|        null|         0.5|        null|      1|     0|     0|     1|      0|   false|
|58872|80686|        null|        null|       0.125|        null|      0|     1|     1|     1|      0|   false|
|19093|75754|        null|        null|         1.0|        null|      1|     0|     0|     0|      0|   false|
|51568|69136|        null|        null|       0.625|        null|      1|     0|     0|     0|      0|   false|
|36952|63401|        null|        null|         0.0|        null|      1|     1|     1|     1|      0|   false|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--

In [85]:
parsed.where(
    col("cmp_fname_c2").isNull() 
).count()

5645434

---
# DataFrame Structure

## Comparison with HTML

| HTML       | HTML Description             | Spark       | Spark Description                                                                                                                                                                                                                                                                                                                       |   |
|------------|-------------------------|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| ```<table>```    | List of ```<tr>```            | DataFrame   | List of Row, and DataFrame is an alias of type DataSet[Row].                                                                                                                                                                                                                                                                      |   |
| ```<tr>```       | Table row. List of ```<td> ```| Row         | List of typed values.  ```Row``` has the schema of type **StructType** |   |
|            |                         | StructType  | Defines the type of a Row and a list of StructField that defines a field of a Row.                                                                                                                                                                                                                                                |   |
|```<td>```Table  | Table field             | StructField | Specify the DataType of a field of a row.                                                                                                                                                                                                                                                                                         |   |
|            |                         | DataType    |                                                                                                                                                                                                                                                                                                                                   |   |


## DataFrame Schema Definition

Instead of using **Schema Inference**, you can provide the **Schema Definition** for  ```SparkSession.read.schema(schema).csv(file)``` to use when reading the CSV file.


* [Data Types](https://spark.apache.org/docs/latest/sql-ref-datatypes.html#data-types)

```from pyspark.sql.types import *```

| Data type | Value type in Python | API to access or create a data type |  |
|:---|:---|:---|:--|
|ByteType | int or long Note: Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127. | ByteType() |  |
| ShortType | int or long Note: Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767. | ShortType() |  |
| IntegerType | int or long | IntegerType() |  |
| LongType | long Note: Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807.Otherwise, please convert data to decimal.Decimal and use DecimalType. | LongType() |  |
| FloatType | float Note: Numbers will be converted to 4-byte single-precision floating point numbers at runtime. | FloatType() |  |
| DoubleType | float | DoubleType() |  |
| DecimalType | decimal.Decimal | DecimalType() |  |
| StringType | string | StringType() |  |
| BinaryType | bytearray | BinaryType() |  |
| BooleanType | bool | BooleanType() |  |
| TimestampType | datetime.datetime | TimestampType() |  |
| DateType | datetime.date | DateType() |  |
| ArrayType | list, tuple, or array | ArrayType(elementType, [containsNull]) Note:The default value of containsNull is True. |  |
| MapType | dict | MapType(keyType, valueType, [valueContainsNull]) Note:The default value of valueContainsNull is True. |  |
| StructType | list or tuple | StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |  |
| StructField | The value type in Python of the data type of this field (For example, Int for a StructField with the data type IntegerType) | StructField(name, dataType, [nullable]) Note: The default value of nullable is True. |  |


In [27]:
schema = StructType([
    StructField("id_1", IntegerType(), False),
    StructField("id_2", StringType(), False),
    StructField("cmp_fname_c1", DoubleType(), False)
])

for element in schema:
    print(element)
    
# spark.read.schema(schema).csv("...")

StructField(id_1,IntegerType,false)
StructField(id_2,StringType,false)
StructField(cmp_fname_c1,DoubleType,false)


## Row

Each row of the DataFrame is an instance of ```pyspark.sql.Row```.

In [28]:
row: pyspark.sql.Row = parsed.first()
row

Row(id_1=3148, id_2=8326, cmp_fname_c1=1.0, cmp_fname_c2=None, cmp_lname_c1=1.0, cmp_lname_c2=None, cmp_sex=1, cmp_bd=1, cmp_bm=1, cmp_by=1, cmp_plz=1, is_match=True)

## Column


In [29]:
id_1 = row['id_1']
print(f"id_1 type:{type(id_1)} value: {id_1}")

id_1 type:<class 'int'> value: 3148


---

# Caching the dataframe

The call to ```cache``` indicates that the contents of the DataFrame should be stored in memory the next time it’s computed. Spark defines a few different mechanisms, or StorageLevel values, for persisting data. cache() is shorthand for [DataFrame.persist](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.persist.html)(StorageLevel.MEMORY), which stores the rows as unserialized Java objects (NOT Python objects).

* [class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication=1)](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.StorageLevel.html)

## [RDD Persistence](https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence)

> One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
> 
> You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
> 
> In addition, each persisted RDD can be stored using a different **storage level**, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The ```cache()``` method is a shorthand for using the default storage level, which is **StorageLevel.MEMORY_ONLY** (store deserialized objects in memory). The full set of storage levels is:

In [30]:
parsed.cache()

DataFrame[id_1: int, id_2: int, cmp_fname_c1: double, cmp_fname_c2: double, cmp_lname_c1: double, cmp_lname_c2: double, cmp_sex: int, cmp_bd: int, cmp_bm: int, cmp_by: int, cmp_plz: int, is_match: boolean]

---
# Cast column type

Estimates of the data has string columns which should be numeric. Convert them to numeric.

In [97]:
summary.printSchema()

root
 |-- summary: string (nullable = true)
 |-- id_1: string (nullable = true)
 |-- id_2: string (nullable = true)
 |-- cmp_fname_c1: string (nullable = true)
 |-- cmp_fname_c2: string (nullable = true)
 |-- cmp_lname_c1: string (nullable = true)
 |-- cmp_lname_c2: string (nullable = true)
 |-- cmp_sex: string (nullable = true)
 |-- cmp_bd: string (nullable = true)
 |-- cmp_bm: string (nullable = true)
 |-- cmp_by: string (nullable = true)
 |-- cmp_plz: string (nullable = true)



In [103]:
summary.show(5)

+-------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+
|summary|              id_1|              id_2|      cmp_fname_c1|      cmp_fname_c2|       cmp_lname_c1|       cmp_lname_c2|            cmp_sex|             cmp_bd|             cmp_bm|            cmp_by|            cmp_plz|
+-------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+
|  count|           5749132|           5749132|           5748125|            103698|            5749132|               2464|            5749132|            5748337|            5748337|           5748337|            5736289|
|   mean| 33324.48559643438| 66587.43558331935|0.7129024704425707| 0.900017671890335|0.3156278193076

In [112]:
for column in summary.columns[1:]:
    summary = summary.withColumn(column, summary[column].cast(DoubleType()))

summary.printSchema()

root
 |-- summary: string (nullable = true)
 |-- id_1: double (nullable = true)
 |-- id_2: double (nullable = true)
 |-- cmp_fname_c1: double (nullable = true)
 |-- cmp_fname_c2: double (nullable = true)
 |-- cmp_lname_c1: double (nullable = true)
 |-- cmp_lname_c2: double (nullable = true)
 |-- cmp_sex: double (nullable = true)
 |-- cmp_bd: double (nullable = true)
 |-- cmp_bm: double (nullable = true)
 |-- cmp_by: double (nullable = true)
 |-- cmp_plz: double (nullable = true)



# Transpose Dataframe

Spark DataFrame does not have transpose method. When the data size is small, convert first to Pandas to transpose.

In [121]:
def transpose(df: pyspark.sql.DataFrame):
    assert df.count() < 1000
    pdf = df.toPandas()
    pdf = pdf.set_index(df.columns[0]).transpose().reset_index()
    pdf = pdf.rename(columns={"index":"field"})
    pdf = pdf.rename_axis(None, axis=1)
    
    transposed: pyspark.sql.Dataframe = spark.createDataFrame(pdf)
    del pdf
    return transposed
   

transpose(summary).show()

[Stage 206:>                                                        (0 + 1) / 1]                                                                                

+------------+---------+-------------------+-------------------+---+--------+
|       field|    count|               mean|             stddev|min|     max|
+------------+---------+-------------------+-------------------+---+--------+
|        id_1|5749132.0|  33324.48559643438| 23659.859374487987|1.0| 99980.0|
|        id_2|5749132.0|  66587.43558331935| 23620.487613269706|6.0|100000.0|
|cmp_fname_c1|5748125.0| 0.7129024704425707| 0.3887583596162788|0.0|     1.0|
|cmp_fname_c2| 103698.0|  0.900017671890335| 0.2713176105782331|0.0|     1.0|
|cmp_lname_c1|5749132.0|0.31562781930763056| 0.3342336339615803|0.0|     1.0|
|cmp_lname_c2|   2464.0|0.31841283153174366| 0.3685670662006655|0.0|     1.0|
|     cmp_sex|5749132.0|  0.955001381078048|0.20730111116897443|0.0|     1.0|
|      cmp_bd|5748337.0|0.22446526708507172|0.41722972238461925|0.0|     1.0|
|      cmp_bm|5748337.0|0.48885529849763504| 0.4998758236779003|0.0|     1.0|
|      cmp_by|5748337.0| 0.2227485966810923| 0.4160909629831711|

---
# Group By and Aggregation

## Column name reference
Two ways we can reference the names of the columns in the DataFrame: either as literal strings, like in groupBy("is_match"), or as Column objects by using the "col()" function. Need to use the ```col ```function to call the ```desc``` method.

In [37]:
parsed.groupby('is_match').count().orderBy(col("count").desc()).show()

                                                                                

+--------+-------+
|is_match|  count|
+--------+-------+
|   false|5728201|
|    true|  20931|
+--------+-------+



In [41]:
parsed.groupby('is_match').agg(
    avg("cmp_plz"),
    stddev("cmp_sex")
).orderBy(col("is_match").desc()).show()



+--------+--------------------+--------------------+
|is_match|        avg(cmp_plz)|stddev_samp(cmp_sex)|
+--------+--------------------+--------------------+
|    true|  0.9584250310975027|  0.1120157059121644|
|   false|0.002043781112285135| 0.20755988859217647|
+--------+--------------------+--------------------+





---
# SparkSQL

In [43]:
parsed.createOrReplaceTempView("linkage")

In [62]:
query = """
SELECT
    COUNT(is_match) AS cnt,
    ROUND(AVG(cmp_plz),5) AS avg_plz,
    STD(cmp_sex)
FROM linkage
GROUP BY is_match
"""

spark.sql(query).show(5)



+-------+-------+----------------------------+
|    cnt|avg_plz|std(CAST(cmp_sex AS DOUBLE))|
+-------+-------+----------------------------+
|  20931|0.95843|          0.1120157059121644|
|5728201|0.00204|         0.20755988859217644|
+-------+-------+----------------------------+



## Join

Calculate the diffence of mean values of the fields between matched records and unmatched records to identify the correlations.

In [129]:
matched_summary_transposed = transpose(matched_summary)
matched_summary_transposed.show(3)

unmatched_summary_transposed = transpose(unmatched_summary)
unmatched_summary_transposed.show(3)

matched_summary_transposed.createOrReplaceTempView("matched")
unmatched_summary_transposed.createOrReplaceTempView("unmatched")

+------------+-----+------------------+-------------------+---+-----+
|       field|count|              mean|             stddev|min|  max|
+------------+-----+------------------+-------------------+---+-----+
|        id_1|20931| 34575.72117911232| 21950.312851969127|  5|99946|
|        id_2|20931| 51259.95939037791| 24345.733453775203|  6|99996|
|cmp_fname_c1|20922|0.9973163859635039|0.03650667584833678|0.0|  1.0|
+------------+-----+------------------+-------------------+---+-----+
only showing top 3 rows

+------------+-------+------------------+-----------------+---+------+
|       field|  count|              mean|           stddev|min|   max|
+------------+-------+------------------+-----------------+---+------+
|        id_1|5728201|33319.913548075565|23665.76013033079|  1| 99980|
|        id_2|5728201| 66643.44259218557|23599.55172824128| 30|100000|
|cmp_fname_c1|5727203|0.7118634802163704|0.389080600969852|0.0|   1.0|
+------------+-------+------------------+-----------------+

In [142]:
query = """
SELECT 
    m.count AS matched_count,
    u.count AS unmatch_count,
    m.count + u.count as total,
    ROUND(m.mean - u.mean, 3) as mean_delta
FROM
    matched AS m 
    INNER JOIN unmatched u ON m.field = u.field
WHERE
    m.field NOT IN ('id_1', 'id_2')
ORDER BY 
    mean_delta desc
"""
spark.sql(query).show()

+-------------+-------------+---------+----------+
|matched_count|unmatch_count|    total|mean_delta|
+-------------+-------------+---------+----------+
|        20902|      5715387|5736289.0|     0.956|
|          475|         1989|   2464.0|     0.806|
|        20925|      5727412|5748337.0|     0.776|
|        20925|      5727412|5748337.0|     0.775|
|        20931|      5728201|5749132.0|     0.684|
|        20925|      5727412|5748337.0|     0.511|
|        20922|      5727203|5748125.0|     0.285|
|         1333|       102365| 103698.0|     0.091|
|        20931|      5728201|5749132.0|     0.032|
+-------------+-------------+---------+----------+



---
# Stop Spark Session

In [143]:
spark.stop()



# Cleanup

In [144]:
del spark
gc.collect()

2181