- Title: Read/Write Parquet Files in Spark
- Slug: spark-io-parquet
- Date: 2019-11-26
- Category: Computer Science
- Tags: programming, Scala, Spark, Parquet
- Author: Ben Du
- Modified: 2019-11-26


[DataFrameReader](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader) 
APIs

[DataFrameWriter](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter)
APIs

https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources

In [1]:
%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1
org.apache.spark spark-hive_2.11 2.3.1

# Load Data

1. `.load` is a general method for reading data in different format. 
    You have to specify the format of the data via the method `.format` of course.
    `.csv` (both for CSV and TSV), `.json` and `.parquet` are specializations of `.load`. 
    `.format` is optional if you use a specific loading function (csv, json, etc.).

2. No header by default.

3. `.coalesece(1)` or `repartition(1)` if you want to write to only 1 file. 


## Load Data in Parquet Format

In [7]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder().master("local")
    .appName("IO")
    .getOrCreate()
spark

org.apache.spark.sql.SparkSession@5d88dad5

In [9]:
val df = spark.read.parquet("f2.parquet")
df.show

+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|cancelled|carrier|tailnum|flight|origin|dest|air_time|distance|hour|min|
+----+-----+---+--------+---------+--------+---------+---------+-------+-------+------+------+----+--------+--------+----+---+
|2014|    1|  1|     914|       14|    1238|       13|        0|     AA| N338AA|     1|   JFK| LAX|     359|    2475|   9| 14|
|2014|    1|  1|    1157|       -3|    1523|       13|        0|     AA| N335AA|     3|   JFK| LAX|     363|    2475|  11| 57|
|2014|    1|  1|    1902|        2|    2224|        9|        0|     AA| N327AA|    21|   JFK| LAX|     351|    2475|  19|  2|
|2014|    1|  1|     722|       -8|    1014|      -26|        0|     AA| N3EHAA|    29|   LGA| PBI|     157|    1035|   7| 22|
|2014|    1|  1|    1347|        2|    1706|        1|        0|     AA| N319AA|   117|   JFK| LAX|     350|   

null

In [10]:
df.count

253316

In [11]:
df.select(input_file_name()).show

+--------------------+
|   input_file_name()|
+--------------------+
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
|file:///workdir/l...|
+--------------------+
only showing top 20 rows



In [1]:
val df = spark.read.load("namesAndAges.parquet")
df.show

<console>: 89

In [9]:
val df = spark.sql("SELECT * FROM parquet.`namesAndAges.parquet`")
df.show

+-------+----+
|   name| age|
+-------+----+
|Michael|null|
|   Andy|  30|
| Justin|  19|
+-------+----+



In [20]:
import java.io.File

new File(".").listFiles.filter(_.getPath.endsWith(".csv"))

Array(./flights14.csv, ./f2.csv)

## Write DataFrame to Parquet

In [32]:
val flights = spark.read.
    format("csv").
    option("header", "true").
    option("mode", "DROPMALFORMED").
    csv("flights14.csv")
flights.write.parquet("f2.parquet")

In [3]:
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

## References

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html