In [1]:
import findspark
findspark.init('/home/purvil/spark-2.4.3-bin-hadoop2.7')

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Aggregation').getOrCreate()

* Spark's core data source
    - CSV, JSON, Parquet, ORC, JDBC/ODBC, Plain-text files
* Also Cassandra, MongoDB, HBase etc. are supported

* Read API
```
DataFrameReader.format(...).option("key", "value").schema(...).load()
```
- Default format is Parquet

* We can get DataFrameeader using `spark.read`

```
spark.read.format("csv")
    .option("mode", "FAILFAST")
    .option("inferSchema", "true")
    .option("path", "path/to/file(s)")
    .schema(someSchema)
    .load()
```
* `mode` : What will happen when spark find malformed data source.
    - `permissive`: default, set all malformed field to null
    - `dropMalformed` : Drop rows with malformed recors
    - `failFast` : Fails immediately


* Write data

```
DataFrameWriter.format(...).option(...).partitionBy(...).bucketBy(...).sortBy(...).save()
```

* By default format for writing is arquet.
* DataFrameWriter is shown as 
```
dataframe.write.format("crv").option("mode", "OVERWRITE").option("dateFormat", "yyyy-MM-dd").option("path", "path to file").save()
```
* Mode are 
    * `append` : Appends the output files to the list of files that already exist at that location
    * `overwrite` : Completely overwrite any data that already exist there
    * `errorIfExists` : Throws error and fail the write if file exists (default)
    * `ignore` : If data or file exist at the location, do nothing.

### CSV

![](images/csv1.png)
![](images/csv2.png)

* Even if data does not fit with specified schema or file does not exist, it will fail at job execution time, not at DataFrame definition time.

In [11]:
csvFile = spark.read.format("csv")\
    .option("header", "true")\
    .option("mode", "FAILFAST")\
    .option("inferSchema", "true")\
    .load("spark_data/flight-data/csv/2015-summary.csv")

In [12]:
csvFile.write.format("csv").mode("overwrite").option("sep", "\t").save("my-tsv-file.tsv")

In [14]:
!ls my-tsv-file.tsv/

part-00000-8c9c7127-f960-440f-9363-fcef7ea7af1f-c000.csv  _SUCCESS


* Number of partition of dataframe at the time of writing = number of files in folder

### JSON

* In Spark we generally use line delimited JSON files
* Other type is large JSON object or array per file.
* When `multiline` option is true we read an entire file as one JSON file.
* Line delimited JSON, allows to append new records.
![](images/json1.png)
![](images/json2.png)
![](images/json3.png)


In [16]:
spark.read.format("json").option("mode", "FAILFAST").option("inferSchema", "true").load("spark_data/flight-data/json/2010-summary.json").show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



* Writing JSON

In [17]:
csvFile.write.format("json").mode("overwrite").save("my-json.json")

In [18]:
!ls my-json.json/

part-00000-fda4046a-2921-466b-ab2c-4d684c93c746-c000.json  _SUCCESS


### Parquest
* Open source column oriented data store, provides storage optimization
* Provides columnar compression, which saves storage and alloews for reading individual column
* Spark's default file format.
![](images/parquet.png)

In [23]:
spark.read.format("parquet").load("spark_data/flight-data/parquet/2010-summary.parquet").show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [24]:
csvFile.write.format("parquet").mode("overwrite").save("my-parquet-file.parquet")

### ORC
* Self describing, type aware columnar file for Hadoop. Optimized for large streaming read, with support for finding required rows quickly.
* Parquet is optimized for Spark and ORC is for Hadoop and Hive.

In [26]:
spark.read.format("orc").load("spark_data/flight-data/orc/2010-summary.orc/").show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [27]:
csvFile.write.format("orc").mode("overwrite").save("my-orc.orc")

### Text file

* Each line in file becomes record in DattaFrame.

* We can control parallelism of files that we write by controlling the partitions prior to writing.
* Splittable files
    - Certain file formats are fundamentally splittable. Which avoid reading an entire file.
    - In HDFS splitted file yield more optimization if it is saved in multiple block
* Not all compression schema are splittable. Using Parquet with gzip compression is the best.
* Multiple executors can not read from same file at the same time, but they can read different file at the same time. When we read folder with multiple file in it, each one of those files will become a partition in dataframe and be read in by available in parallel.

* Number of files written is dependent on the number of partitions the dataframe has at the time we write the data.
* One file is written per partition of data by default.
* So, if we specify a filename it will be number of files in the folder.

In [32]:
csvFile.repartition(5).write.format("csv").save("multiple.csv")

In [35]:
! ls multiple.csv/

part-00000-075f0b09-4e88-41dd-ba11-73ef7898213e-c000.csv
part-00001-075f0b09-4e88-41dd-ba11-73ef7898213e-c000.csv
part-00002-075f0b09-4e88-41dd-ba11-73ef7898213e-c000.csv
part-00003-075f0b09-4e88-41dd-ba11-73ef7898213e-c000.csv
part-00004-075f0b09-4e88-41dd-ba11-73ef7898213e-c000.csv
_SUCCESS


* Partitioning is a tool that allows us to control what and where data is stored.
* When we write data to a paritioned directory, we encode column as a folder. This allows to skip lots of data when we go to read it later.

In [36]:
csvFile.limit(10).write.format("parquet").mode("overwrite").partitionBy("DEST_COUNTRY_NAME").save("partitioned.parquet")

In [37]:
!ls partitioned.parquet/

'DEST_COUNTRY_NAME=Costa Rica'	'DEST_COUNTRY_NAME=Senegal'
'DEST_COUNTRY_NAME=Egypt'	'DEST_COUNTRY_NAME=United States'
'DEST_COUNTRY_NAME=Moldova'	 _SUCCESS


* Each file has data where previous predicate was true.

In [38]:
!ls partitioned.parquet/DEST_COUNTRY_NAME\=Senegal

part-00000-c726aa7a-72a2-4d5f-901c-4184a3f70f10.c000.snappy.parquet


* Frequent filters like a date are good candidate for partition


### Bucketing
* We can control the data that is specifically written to each file.
* Which can avoid shuffles, because data with same bucket id will all be grouped together into one physical partition.

* When we parition one specific column, we might write out so many directories. Bucketing will create a certain number of files and organize our data into those buckets.

In [41]:
numberBuckets = 10
columnToBucketBy = "count"

In [42]:
csvFile.write.format("parquet").mode("overwrite").bucketBy(numberBuckets, columnToBucketBy).saveAsTable("bucketedFile")

In [43]:
! ls bucketedFile

ls: cannot access 'bucketedFile': No such file or directory


* Writing lots of small file can create lots of metadata. Having lage file will lead us to read entire block of data when we want only few rows.
* `maxRecordsPerFile` allows us to control file size. `df.write.option("maxRecordsPerFile", 5000)`