# 2b. Data Sources

Spark SQL supports operations on variety of data sources through the DataFrame interface. Can perform relational transformations or create temporary views - allowing SQL queries to be run over the data.

Section will describe general methods for loading and saving data using Spark Data Sources and options for built-in data sources.

## Generic Load/Save Functions

Default data source (`parquet`) will be used for all operations

* `spark.read.load(<source>)` -- load data 
* `df.select(<cols>).write.save(<path>)` -- save data

__Apache parquet__

File format to support fast data processing for complex data. Has better compression, speed and performance compared to other file formats, therefore it's useful for large amounts of data

Features:

* Columnar - data entries are stored in columns instead of rows
* Open Source - free to use and open source under Apache Hadoop
* Self-describing - contains metadata like schema and structure; has standards used for accessing each record

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("data sources").getOrCreate()

# Load data
df = spark.read.load("users.parquet")
df.show()

# Save data
# df.select('name', 'favorite_color').write.save('saved_data/namesAndFavColour.parquet')


+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+



### Manually Specifying Options

Manually specify extra options to pass to the data source. Data sources can generally be converted into other types of data using short names (e.g json, parquet, orc, libsvm, csv, ...)

In [6]:
# Loading json file
df = spark.read.load("people.json", format='json')

# Saving in parquet format
# df.select('name', 'age').write.save("nameAndAges.parquet", format='parquet')

# Loading csv
df_csv = spark.read.load('people.csv', format='csv', sep=';', inferSchema='true', header='true')
df_csv.show()

+-----+---+---------+
| name|age|      job|
+-----+---+---------+
|Jorge| 30|Developer|
|  Bob| 32|Developer|
+-----+---+---------+



Extra options also in write operations. The below example will create a bloom filter and use dictionary encodings only for the `favorite_color` column on an ORC data source.

In [19]:
# Read the data
df = spark.read.orc('users.orc')
(df.write.format('orc')
	.option('orc.bloom.filter.columns', 'favorite_color')
	.option('orc.dictionary.key.threshold', '1.0')
	.option('orc.column.encoding.direct', 'name')
	.save('users_with_options.orc'))

### Run SQL on files directly

Instead of using read API to load files, you can immediately run SQL queries by wrapping file path in \`\`

In [17]:
df = spark.sql("SELECT * FROM parquet.`users.parquet`")
df.show()

+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+



### Save Modes

Optinonal parameter to specify how to handle existing data if present. Save modes do not utilize any locking and are not atomic. When performing `Overwrite`, the data is deleted before writing out new data.

`spark.select(<cols>).write.format(<format>).save(<path>, <mode>)`

* `error` or `errorifexists` -- throw exception if file already exists
* `append` -- append new data source to existing data, headers must match
* `overwrite` -- if table exists, overwrite the table
* `ignore` -- save operation is expected not to change the existing data

---

## Generic File Source Option

Options are only used for file-based sourceS: _parquet, orc, avro, json, csv, text_

### Ignoring Corrupt/Missing Files

Spark jobs will continue to run when encountering corrupt/missing files. Contents that have been read will still be returned

* `spark.sql.files.ignoreCorruptFiles` - ignore corrupt files
* `spark.sql.files.ignoreMissingFiles` - ignore missing files (missing files are generally deleted files under the directory)

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("data sources").getOrCreate()

In [4]:
# Enable ignore corrupt files
spark.sql('set spark.sql.files.ignoreCorruptFiles=true')

# json file will be ignored 
test_corrupt_df = spark.read.parquet(
	"dir1/",
	"dir1/dir2"
)
test_corrupt_df.show()

+-------------+
|         file|
+-------------+
|file1.parquet|
|file2.parquet|
+-------------+



### Path Global Filter

Only include files with file names that match a pattern. Does not change the behaviour of partition discovery

It is called a method in the `.load()` function

In [5]:
df = spark.read.load('dir1', format='parquet', pathGlobFilter='*.parquet')
df.show()

+-------------+
|         file|
+-------------+
|file1.parquet|
+-------------+



### Recursive Lookup

Recursively load files and it disables parition inferring. Default to `False`. If data source specifies the `partitionSpec` when this option is True, an exception will be thrown

In [7]:
recursive_loaded_df = spark.read.format('parquet') \
	.option('recursiveFileLookup', 'true') \
	.load('dir1')
recursive_loaded_df.show()

+-------------+
|         file|
+-------------+
|file1.parquet|
|file2.parquet|
+-------------+



### Modification Time Path Filters

Options to achieve greater granularity over which files may load during a Spark batch query.

* `modifiedBefore`: an optional timestamp to include files with modification time occurring before a spacified time
* `modifiedAfter`: an optional timestamp to include files with modification time occurring after a spacified time

Timestamp format: __YYYY-MM-DDTHH:mm:ss__ (e.g 2022-09-19T12:00:14)

When timezone option not specified, it will default to Spark session timezone

Note: Structure Streaming file sources don't support these options

In [13]:
# modifiedBefore a specific date
df = spark.read.load('dir1', format='parquet', modifiedBefore='2050-07-01T08:30:00')
df.show()

# modifiedAfter a specific date
df = spark.read.load('dir1', format='parquet', modifiedAfter='2050-06-01T08:30:00')
df.show() # empty return

+-------------+
|         file|
+-------------+
|file1.parquet|
+-------------+



---

## Parquet Files

Parquet is a columnar format file format used generally for big data operations. Spark SQL supports both reading and writing parquet files that automatically preserve the schema of the original data. All columns are automatically converted to nullable for compatibility reasons when reading a parquet file.

### Loading Data Programmatically

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("parquet files").getOrCreate()

df_people = spark.read.json('people.json')

# DataFrames saved as parquet files
df_people.write.parquet('people.parquet')

# Read in the newly created parquet file
parquet_people = spark.read.parquet('people.parquet')

# parquet files can also be used to create a temporary view for SQL queries
parquet_people.createOrReplaceTempView('parquet_people')
teenagers = spark.sql("SELECT name FROM parquet_people WHERE age >= 13 AND age <= 19")
teenagers.show()

### Partition Discovery

Table partitioning is a common optimization approach used in systems. In a partitioned table, data is usually stored in different directories, with partitioning column values encoded in the path of each directory. All built-in file sources are able to discover and infer partitioning information automatically.

__Features__

* Data types are automatically inferred when importing
	- Manual specification can also be done by changing `spark.sql.sources.partitionColumnTypeInference.enabled` to `False` and will default to `string` 

### Schema Merging

Gradually add more columns to schemas when needed. Users may end up with multiple _parquet_ files with different but mutually compatible schemas. Parquet data source will automatically detect and merge schemas of these files

Schema merging is a relatively expensive operation, so it is turned off by default. To enable:

1. Set data source option `mergeSchema` to true when reading Parquet files
2. Set global SQL option `spark.sql.parquet.mergeSchema` to true

In [None]:
from pyspark.sql import Row

# Create DataFrame and store into partition directory
sc = spark.sparkContext

df_squares = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: Row(single=i, double=i**2)))
df_squares.write.parquet('saved_data/test_table/key=1')

# Create another DataFrame in a new parition directory
# add a new column and drop an existing column
df_cubes = spark.createDataFrame(sc.parallelize(range(6, 11)).map(lambda i: Row(single=i, triple=i**3)))
df_cubes.write.parquet('saved_data/test_table/key=2')

# Read paritioned table
df_merged = spark.read \
	.option('mergedSchema', 'true') \
	.parquet('saved_data/test_table')
df_merged.printSchema()


### Hive metastore Parquet table conversion

Spark SQL will try to use own parquet support instead of Hive SerDe for better performance when reading from Hive metastore. Controlled by `spark.sql.hive.convertMetastoreParquet` turned on by default

__Schema Reconciliation__

Differences:

1. Hive is case insensitive. Parquet is not
2. Hive considers all columns nullable. In Parquet nullability is important

Rules:

1. Fields that have the same name in both schemas must have the same data type regardless of nullability. Nullability must be respected for parquet side as well
2. Reconciled schema contains only fields defined in Hive metastore schema
	* fields that only appear in parquet schema are dropped
	* fields that only appear in Hive metastore are added as nullable fields

__Metadata Refreshing__

Spark SQL caches parquet metadata for better performance, including converted tables. Need to manually referesh tables if they are updated by Hive or external tools

* `spark.catalog.refreshTable(<table>)` -- refresh table in existing SparkSession

### Colmnar Encryption

