# Spark Data Sources

This notebook shows how to use Spark Data Sources Interface API to read file formats:
 * Parquet
 * JSON
 * CSV
 * Avro
 * ORC
 * Image
 * Binary

A full list of DataSource methods is available [here](https://docs.databricks.com/spark/latest/data-sources/index.html#id1)

## Define paths for the various data sources

In [1]:
parquet_file = "/home/karthik/SparkCourse/pyspark notebooks/data/flights/summary-data/parquet/2010-summary.parquet"
json_file = "/home/karthik/SparkCourse/pyspark notebooks/data/flights/summary-data/json/*"
csv_file = "/home/karthik/SparkCourse/pyspark notebooks/data/flights/summary-data/csv/*"
orc_file = "/home/karthik/SparkCourse/pyspark notebooks/data/flights/summary-data/orc/*"
avro_file = "/home/karthik/SparkCourse/pyspark notebooks/data/flights/summary-data/avro/*"
schema = "DESC_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count INT"

## Creating a SparkSession

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName('Data Sources').getOrCreate()

## Parquet Data Source

In [3]:
df_parquet = spark.read.format("parquet").option("path",parquet_file).load()

Another way to read this same data using a variation of this API

In [4]:
df2_parquet = spark.read.parquet(parquet_file)

In [5]:
df_parquet.show(5,False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
+-----------------+-------------------+-----+
only showing top 5 rows



In [6]:
df2_parquet.show(5,False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
+-----------------+-------------------+-----+
only showing top 5 rows



## Use SQL

This will create an _unmanaged_ temporary view

In [7]:
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW us_delays_flights_parquet
USING parquet
OPTIONS(path "/home/karthik/SparkCourse/pyspark notebooks/data/flights/summary-data/parquet/2010-summary.parquet"
)""")

DataFrame[]

In [8]:
spark.sql("""show views""").show(truncate=False)

+---------+-------------------------+-----------+
|namespace|viewName                 |isTemporary|
+---------+-------------------------+-----------+
|         |us_delays_flights_parquet|true       |
+---------+-------------------------+-----------+



Use SQL to query the table

The outcome should be the same as one read into the DataFrame above

In [9]:
spark.sql("""
SELECT * FROM us_delays_flights_parquet LIMIT 5""").show(truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
+-----------------+-------------------+-----+



## JSON Data Source

In [10]:
df_json = spark.read.format("json").option("path",json_file).load()

Another way to read this same data using a variation of this API

In [11]:
df2_json = spark.read.json(json_file)

In [12]:
df_json.show(5,False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
+-----------------+-------------------+-----+
only showing top 5 rows



In [13]:
df2_json.show(5,False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
+-----------------+-------------------+-----+
only showing top 5 rows



## Use SQL

This will create an _unmanaged_ temporary view

In [14]:
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW us_delays_flight_json
USING json
OPTIONS(path "/home/karthik/SparkCourse/pyspark notebooks/data/flights/summary-data/json/*")""")

DataFrame[]

In [15]:
spark.sql("""
SHOW VIEWS""").show(truncate=False)

+---------+-------------------------+-----------+
|namespace|viewName                 |isTemporary|
+---------+-------------------------+-----------+
|         |us_delays_flight_json    |true       |
|         |us_delays_flights_parquet|true       |
+---------+-------------------------+-----------+



Use SQL to query the table

The outcome should be the same as one read into the DataFrame above

In [16]:
spark.sql("""
SELECT * FROM us_delays_flight_json LIMIT 5
""").show(truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
+-----------------+-------------------+-----+



## CSV Data Source

In [17]:
df_csv = (spark.read.format("csv")
        .option("header","true")
          .schema(schema)
          .option("path",csv_file)
          .option("mode","FAILFAST")
          .option("nullValue","")
          .load())

Another way to read this same data using a variation of this API

In [18]:
df2_csv = spark.read.csv(csv_file,header=True,schema=schema,mode="FAILFAST",nullValue="")

In [19]:
df_csv.show(5,False)

+-----------------+-------------------+-----+
|DESC_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
+-----------------+-------------------+-----+
only showing top 5 rows



In [20]:
df2_csv.show(5,False)

+-----------------+-------------------+-----+
|DESC_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
+-----------------+-------------------+-----+
only showing top 5 rows



In [24]:
(df_csv.write.format("parquet")
 .option("compression","snappy")
 .option("path","/home/karthik/SparkCourse/pyspark notebooks/data/flights/parquet/output")
 .option("mode","overwrite")
 .save())

In [25]:
%ls -lrt "/home/karthik/SparkCourse/pyspark notebooks/data/flights/parquet/output"

total 12
-rw-r--r-- 1 karthik karthik 10453 May 22 17:48 part-00000-664d9513-4148-4aea-9fe2-b63f8280efd8-c000.snappy.parquet
-rw-r--r-- 1 karthik karthik     0 May 22 17:48 _SUCCESS


## Use SQL

This will create an _unmanaged_ temporary view

In [26]:
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW us_delays_flights_csv
USING csv
OPTIONS(path "/home/karthik/SparkCourse/pyspark notebooks/data/flights/summary-data/csv/*",header "True")""")

DataFrame[]

In [27]:
spark.sql("""
SHOW VIEWS""").show(truncate=False)

+---------+-------------------------+-----------+
|namespace|viewName                 |isTemporary|
+---------+-------------------------+-----------+
|         |us_delays_flight_json    |true       |
|         |us_delays_flights_csv    |true       |
|         |us_delays_flights_parquet|true       |
+---------+-------------------------+-----------+



In [28]:
spark.sql("""
SELECT * FROM us_delays_flights_csv LIMIT 5""").show(truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
+-----------------+-------------------+-----+



## ORC Data Source

In [29]:
df_orc = (spark.read
         .format("orc")
         .option("path",orc_file)
         .load())

In [30]:
df_orc.show(5,False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
+-----------------+-------------------+-----+
only showing top 5 rows



## Use SQL

This will create an _unmanaged_ temporary view

In [31]:
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW us_delays_flights_orc
USING orc
OPTIONS(path "/home/karthik/SparkCourse/pyspark notebooks/data/flights/summary-data/orc/*")""")

DataFrame[]

In [32]:
spark.sql("""
SHOW VIEWS""").show(truncate=False)

+---------+-------------------------+-----------+
|namespace|viewName                 |isTemporary|
+---------+-------------------------+-----------+
|         |us_delays_flight_json    |true       |
|         |us_delays_flights_csv    |true       |
|         |us_delays_flights_orc    |true       |
|         |us_delays_flights_parquet|true       |
+---------+-------------------------+-----------+



In [33]:
spark.sql("""
SELECT * FROM us_delays_flights_orc
""").show(5,False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
+-----------------+-------------------+-----+
only showing top 5 rows



## Avro Data Source

In [34]:
df_avro = spark.read.format("avro").option("path",avro_file).load()

AnalysisException: Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;

## Image

In [37]:
from pyspark.ml import image

image_dir = "/home/karthik/SparkCourse/pyspark notebooks/data/cctvVideos/train_images/"
images_df = spark.read.format("image").load(image_dir)
images_df.printSchema()

images_df.select("image.height", "image.width", "image.nChannels", "image.mode", "label").show(5, truncate=False)

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |-- label: integer (nullable = true)



Py4JJavaError: An error occurred while calling o153.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage 22.0 (TID 22, localhost, executor driver): java.io.FileNotFoundException: File file:/home/karthik/SparkCourse/pyspark%20notebooks/data/cctvVideos/train_images/label=0/LeftBagframe0004.jpg does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3627)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2697)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2697)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2904)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
	at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/home/karthik/SparkCourse/pyspark%20notebooks/data/cctvVideos/train_images/label=0/LeftBagframe0004.jpg does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


## Binary

In [38]:
path = "/home/karthik/SparkCourse/pyspark notebooks/data/cctvVideos/train_images/"
binary_files_df = (spark.read.format("binaryFile")
  .option("pathGlobFilter", "*.jpg")
  .load(path))

binary_files_df.show(5)

+--------------------+-------------------+------+--------------------+-----+
|                path|   modificationTime|length|             content|label|
+--------------------+-------------------+------+--------------------+-----+
|file:/home/karthi...|2021-05-22 16:54:27| 55037|[FF D8 FF E0 00 1...|    0|
|file:/home/karthi...|2021-05-22 16:54:28| 54634|[FF D8 FF E0 00 1...|    1|
|file:/home/karthi...|2021-05-22 16:54:27| 54624|[FF D8 FF E0 00 1...|    0|
|file:/home/karthi...|2021-05-22 16:54:27| 54505|[FF D8 FF E0 00 1...|    0|
|file:/home/karthi...|2021-05-22 16:54:27| 54475|[FF D8 FF E0 00 1...|    0|
+--------------------+-------------------+------+--------------------+-----+
only showing top 5 rows



In [39]:
binary_files_df = (spark.read.format("binaryFile")
   .option("pathGlobFilter", "*.jpg")
   .option("recursiveFileLookup", "true")
   .load(path))
binary_files_df.show(5)


+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|file:/home/karthi...|2021-05-22 16:54:27| 55037|[FF D8 FF E0 00 1...|
|file:/home/karthi...|2021-05-22 16:54:28| 54634|[FF D8 FF E0 00 1...|
|file:/home/karthi...|2021-05-22 16:54:27| 54624|[FF D8 FF E0 00 1...|
|file:/home/karthi...|2021-05-22 16:54:27| 54505|[FF D8 FF E0 00 1...|
|file:/home/karthi...|2021-05-22 16:54:27| 54475|[FF D8 FF E0 00 1...|
+--------------------+-------------------+------+--------------------+
only showing top 5 rows

