# Day 10 - How to Connect to Different Data Sources and Sinks
So far, I've connected my queries to two different data source types, CSV and JSON, but there are mony more types of datas source formats Spark can read/write data from/to, like Paquet, Avro, XML, JDBC database connections, or Hive Tables. 

My task for today is to investigate the generic structure of Spark connectors which are implemented by the `pyspark.sql.readwriter` module. Since reading and writing data are tasks in every Spark application, maybe I can derive some basic pattern which I can use for re-usablecode templates.

## Class pyspark.sql.readwriter.[DataFrameReader](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/readwriter.html#DataFrameReader)

Reading data is a data source -> DataFrame transformation. Since there is no input DataFrame for this transformation I cannot apply it on a DataFrame. Instead the data read API is bound to the `SparkSession`, the access path is:

`SparkSession.read`

As I observered several times before, there are again multiple options in Spark to get the same results. Actually there are three differnt was of how to read or write data to/from external resources.
    
The first layout version for data reading puts every option in a single function call. Some options have even its own function name like `format()` and `schema()`. All these option functions are `DataFrameReader` -> `DataFrameReader` transformations so they can be sticked together. Only the `load()` function must be the last on in the chain, because it is a `DataFrameReader` -> `DataFrame` transformation. This is the layout version I've used so far, because I was not aware of the other versions.

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType

spark = SparkSession\
   .builder\
   .getOrCreate()

myOwnCsv = StructType([
    StructField("DEST_COUNTRY_NAME",StringType(),True),
    StructField("ORIGIN_COUNTRY_NAME",StringType(),True),
    StructField("count",StringType(),False)
])

csvDF = spark.read\
    .format("csv")\
    .option("path", "./data/flight-data/2015-summary.csv")\
    .option("header", "true")\
    .schema(myOwnCsv)\
    .load()\
    .show(10)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 10 rows



The second layout passes all options as arguments into the `load()` function. This is the most generic layout.

In [3]:
csvDF = spark.read\
    .load(
        format="csv",
        path="./data/flight-data/2015-summary.csv",
        schema=myOwnCsv,
        header=True
    )\
    .show(10)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 10 rows



The third layout uses format-specific load functions, e.g. `csv()` to load CSV files. This is the most concise layout and my favourite one from now on. 

pyspark provides the following load functions out-of the box
* `csv()`
* `jdbc()`
* `json()`
* `paquet()`
* `orc()`
* `table()`
* `text()`

By the way: *Paquet* is the default format in Spark, because it is column-orientated, which supports column compression and splittable. So if I don't specify the `format` option, Spark or pyspark will take the Parquet format for both read and write operations.

Each data source format has its own subset of available options, so I have to reference the pyspark documentation to check, which options are applicable, optional or mandatory. Nonetheless, one generic pattern is that for all file based formats I must define the `path` option.

In [4]:
csvDF = spark.read\
    .csv(
        path="./data/flight-data/2015-summary.csv",
        header=True,
        schema=myOwnCsv
    )

csvDF.show(10)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 10 rows



## Class pyspark.sql.readwriter.[DataFrameWriter](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter)

Writing data is a DataFrame -> data sink transformation. Therfore the data write API is bound to the `DataFrame`, the access path is:

`DataFrame.write`

The layout variants are nearly the same as for the reading API except for `load()` is replaced by save functions. 
pyspark provides the following format specific save functions out-of the box:
* `csv()`
* `jdbc()`
* `json()`
* `paquet()`
* `orc()`
* `saveAsTable()`or `ìnsertInto()`
* `text()`
    
A further generic write parameter is the *save mode* which specifies the behavior of the save operation when data already exists.
* `append`: Append contents of this DataFrame to existing data.
* `overwrite`: Overwrite existing data.
* `ignore`: Silently ignore this operation if data already exists.
* `error` or `errorifexists` (default case): Throw an exception if data already exists.

In [31]:
csvDF.write \
    .csv(
        path="./data/flight-data/2015-output.csv",
        header=True,
        sep=";",
        mode="overwrite",
        encoding="utf-8",
        compression=None
    )

This command will create a folder "2015-output.csv" which represents the DataFrame and store one csv file and one crc checksum file. Values are separated by semicolons and existing files will be overwritten.

I can speed up the write operation as well as later read operations, if I partition the data, so Spark will create one file for each partition instead of one big file. Since each file can only be processed by one process at a time, splitting-up the data is pre-requesite for parallel processing.

If I want to partition my flight data by destination country, all I have to do is to partition the DataFrame by that column before writing the data to files.

In [8]:
csvDF \
    .write \
    .partitionBy("DEST_COUNTRY_NAME") \
    .csv(
        path="./data/flight-data/2015-output.csv",
        header=True,
        sep=";",
        mode="overwrite",
        encoding="utf-8",
        compression=None
    )

This command will create a folder "2015-output.csv" which represents the DataFrame a sub-folder for each partition.

<img src= "./screenshots/day-010/day-010-partitioned-csv-files.png">

## Reading and Writing JSON Files
There is an aspect regarding JSON files, I should keep in mind. By default, pyspark assumes that a source JSON file is actually [newline-delimited JSON](http://jsonlines.org), i.e. the file contains only single line JSON objects but many of them.

Example:

`{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15}
{"ORIGIN_COUNTRY_NAME":"Croatia","DEST_COUNTRY_NAME":"United States","count":1}
{"ORIGIN_COUNTRY_NAME":"Ireland","DEST_COUNTRY_NAME":"United States","count":344}
{"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Egypt","count":15}
{"ORIGIN_COUNTRY_NAME":"India","DEST_COUNTRY_NAME":"United States","count":62}`

These simplified version of JSON is optimal for record by record processing. Neveretheless the original JSON format allows files containing one big complex JSON object or an array of nested objects with multiple hierarchy levels.

Example:

`{"customers":[
    {
        "cusomerId":1,
        "address:{"street":"Reeperbahn 2". "city":"Hamburg", "country":"Germany"},
        "date-of-birth":"1980-12-17",
        "names":{"currentName":"Mayer", "givenName":"Schmitz"}
    },
    {
        "cusomerId":2,
        "address:{"street":"Aachener Strasse 234". "city":"Cologne", "country":"Germany"},
        "date-of-birth":"1978-06-27",
        "names":{"currentName":"Müller", "givenName":""}
    },
    ... 
    ]
}`

To read those complex JSON files, I need to define the option `multiLine=True` (default is False)

For more details I can reference to Spark SQL documentation on 
[JSON files](https://spark.apache.org/docs/latest/sql-data-sources-json.html)

## Reading and Writing Paquet Files
Paquet files include their own schema definition and enforcement, so when I write data to a Paquet file, I won't have any schema option. Reading from Paquet file always implies schema inference. The only schema related option I have here is to set `mergeSchema=True` to merge schemas collected from Parquet part-files having divergent schemas.

For more details I can reference to Spark SQL documentation on 
[Parquet files](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html)

## Reading and Writing Avro Files
Avro files are supported since Spark 2.4 by an external package `spark-avro`. There is no Avro specific save or load function like `avro()` yet. Instead, I have to specify the datas source format option .format("avro") or .option(format="avro").

For more details I can reference to Spark SQL documentation on 
[Avro files](https://spark.apache.org/docs/latest/sql-data-sources-avro.html)
## JDBC Connections to Databases
The load and saving options for JDBC database connections are quite different from file formats. Instead of a path and filename, I have to specifie at least:

* the JDBC **driver**
* the JDBC connection **url**
* the database table **dbtable** I want to read or write data

`jdbcDF.write.jdbc(
    driver="postgresql-9.4.1297.jar",
    url="jdbc:postgresql://localhost/test?user=fred&password=secret",
    dbtable="schema.table_name"`

`jdbcDF.write
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .save()`
  
As I've learned on day 3 that Spark tries to optimize query exection by pushing down predicates to the sources as much as possible. For databases sometimes I can get even better performance results by specifing a **query** instead of a **dbtable** option. Here I can use the entire native SQL dialect of the database I'm connecting to.

Another important (even though not madatory) option is **numPartitions**. Spark is designed for highly parallelised distributed data processing. Therefore I have to limit the number of parallel JDBC connections to a degree, the RDBMS can handle it. In conjunction with the **numPartitions** option, I can parallelise table read processes by partitioning the data so each process reads a disjunctive part of the data, by adding the following options.

`.option("numPartitions", number)
.option("partitionColumn", column)
.option("lowerBound", value)
.option("upperBound", value)`

This *read* partitioning is independent from the physical *table* partitioning in the database. However I will gain the best performance, if my read partitioning goes along with the table partitioning, otherwise there will be a performance bootleneck on the database server.

For more details I can reference to Spark SQL documentation on 
[JDBC connections](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)
## Reading and Writing Hive Tables
So far I've never worked with Hive tables, so I leave just the link to the Spark SQL documentation on 
[Hive tables](https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html) for later reference.