# 2b. Data Sources

Spark SQL supports operations on variety of data sources through the DataFrame interface. Can perform relational transformations or create temporary views - allowing SQL queries to be run over the data.

Section will describe general methods for loading and saving data using Spark Data Sources and options for built-in data sources.

## Generic Load/Save Functions

Default data source (`parquet`) will be used for all operations

* `spark.read.load(<source>)` -- load data 
* `df.select(<cols>).write.save(<path>)` -- save data

__Apache parquet__

File format to support fast data processing for complex data. Has better compression, speed and performance compared to other file formats, therefore it's useful for large amounts of data

Features:

* Columnar - data entries are stored in columns instead of rows
* Open Source - free to use and open source under Apache Hadoop
* Self-describing - contains metadata like schema and structure; has standards used for accessing each record

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("data sources").getOrCreate()

# Load data
df = spark.read.load("users.parquet")
df.show()

# Save data
# df.select('name', 'favorite_color').write.save('saved_data/namesAndFavColour.parquet')


+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+



### Manually Specifying Options

Manually specify extra options to pass to the data source. Data sources can generally be converted into other types of data using short names (e.g json, parquet, orc, libsvm, csv, ...)

In [6]:
# Loading json file
df = spark.read.load("people.json", format='json')

# Saving in parquet format
# df.select('name', 'age').write.save("nameAndAges.parquet", format='parquet')

# Loading csv
df_csv = spark.read.load('people.csv', format='csv', sep=';', inferSchema='true', header='true')
df_csv.show()

+-----+---+---------+
| name|age|      job|
+-----+---+---------+
|Jorge| 30|Developer|
|  Bob| 32|Developer|
+-----+---+---------+



Extra options also in write operations. The below example will create a bloom filter and use dictionary encodings only for the `favorite_color` column on an ORC data source.

In [19]:
# Read the data
df = spark.read.orc('users.orc')
(df.write.format('orc')
	.option('orc.bloom.filter.columns', 'favorite_color')
	.option('orc.dictionary.key.threshold', '1.0')
	.option('orc.column.encoding.direct', 'name')
	.save('users_with_options.orc'))

### Run SQL on files directly

Instead of using read API to load files, you can immediately run SQL queries by wrapping file path in \`\`

In [17]:
df = spark.sql("SELECT * FROM parquet.`users.parquet`")
df.show()

+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+



### Save Modes

Optinonal parameter to specify how to handle existing data if present. Save modes do not utilize any locking and are not atomic. When performing `Overwrite`, the data is deleted before writing out new data.

`spark.select(<cols>).write.format(<format>).save(<path>, <mode>)`

* `error` or `errorifexists` -- throw exception if file already exists
* `append` -- append new data source to existing data, headers must match
* `overwrite` -- if table exists, overwrite the table
* `ignore` -- save operation is expected not to change the existing data

---

## Generic File Source Option

Options are only used for file-based sourceS: _parquet, orc, avro, json, csv, text_

### Ignoring Corrupt Files

Spark jobs will continue to run when encountering corrupt files. Contents that have been read will still be returned

* `spark.sql.ignoreCorruptFiles` - ignore corrupt files

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("data sources").getOrCreate()

In [20]:
# Enable ignore corrupt files
spark.sql('set spark.sql.files.ignoreCorruptFiles=true')

# json file will be ignored 
test_corrupt_df = spark.read.parquet('dir1/', 'dir1/dir2')
test_corrupt_df.show()

Py4JJavaError: An error occurred while calling o216.parquet.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
	at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
	at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
	at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1218)
	at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1423)
	at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
	at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
	at org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
	at org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
	at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:158)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:131)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:94)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:66)
	at org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:567)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:409)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
	at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:562)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Unknown Source)
