In [12]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('readJson').getOrCreate()

## Read JSON file into DataFrame
Using `read.json("path")` or `read.format("json").load("path")` you can read a JSON file into a PySpark DataFrame, these methods take a file path as an argument.

In [13]:
# Read JSON file into dataframe
df = spark.read.json("./resources/json_files/zipcodes.json")
df.printSchema()
df.show(5)

root
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Decommisioned: boolean (nullable = true)
 |-- EstimatedPopulation: long (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- LocationText: string (nullable = true)
 |-- LocationType: string (nullable = true)
 |-- Long: double (nullable = true)
 |-- Notes: string (nullable = true)
 |-- RecordNumber: long (nullable = true)
 |-- State: string (nullable = true)
 |-- TaxReturnsFiled: long (nullable = true)
 |-- TotalWages: long (nullable = true)
 |-- WorldRegion: string (nullable = true)
 |-- Xaxis: double (nullable = true)
 |-- Yaxis: double (nullable = true)
 |-- Zaxis: double (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: long (nullable = true)

+-------------------+-------+-------------+-------------------+-----+--------------------+--------------------+--------------+------+-----+------------+-----+---------------+----------+-----

In [14]:
df1 = spark.read.format('org.apache.spark.sql.json').load("./resources/json_files/zipcodes.json")
df1.show(5)

+-------------------+-------+-------------+-------------------+-----+--------------------+--------------------+--------------+------+-----+------------+-----+---------------+----------+-----------+-----+-----+-----+-----------+-------+
|               City|Country|Decommisioned|EstimatedPopulation|  Lat|            Location|        LocationText|  LocationType|  Long|Notes|RecordNumber|State|TaxReturnsFiled|TotalWages|WorldRegion|Xaxis|Yaxis|Zaxis|ZipCodeType|Zipcode|
+-------------------+-------+-------------+-------------------+-----+--------------------+--------------------+--------------+------+-----+------------+-----+---------------+----------+-----------+-----+-----+-----+-----------+-------+
|        PARC PARQUE|     US|        false|               NULL|17.96|NA-US-PR-PARC PARQUE|     Parc Parque, PR|NOT ACCEPTABLE|-66.22| NULL|           1|   PR|           NULL|      NULL|         NA| 0.38|-0.87|  0.3|   STANDARD|    704|
|PASEO COSTA DEL SUR|     US|        false|             

## Read JSON file from multiline
PySpark JSON data source provides multiple options to read files in different options, use multiline option to read JSON files scattered across multiple lines. By default multiline option, is set to false.


Using read.option("multiline","true")

In [15]:
# Read multiline json file
multiline_df = spark.read.option("multiline","true") \
      .json("./resources/json_files/multiline-zipcode.json")
multiline_df.show() 

+-------------------+------------+-----+-----------+-------+
|               City|RecordNumber|State|ZipCodeType|Zipcode|
+-------------------+------------+-----+-----------+-------+
|PASEO COSTA DEL SUR|           2|   PR|   STANDARD|    704|
|       BDA SAN LUIS|          10|   PR|   STANDARD|    709|
+-------------------+------------+-----+-----------+-------+



## Reading multiple files at a time
Using the read.json() method you can also read multiple JSON files from different paths, just pass all file names with fully qualified paths by separating comma, for example

In [16]:
# Read multiple files
df2 = spark.read.json(
    ['./resources/json_files/n1.json','./resources/json_files/n2.json'])
df2.show() 

+--------------------+-----------+------------------+--------------------+---------------+
|     _corrupt_record|     action|          customer|              device|           time|
+--------------------+-----------+------------------+--------------------+---------------+
|[{"time":"7:15:38...|       NULL|              NULL|                NULL|           NULL|
|                NULL|   power on| Pietro MacMenamie|         Amazon Echo|10:17:17.000 AM|
|                NULL|   power on|  Diego Caudrelier|  GreenIQ Controller|10:19:17.000 AM|
|                NULL|low battery|       Ben Humpage|         Amazon Echo| 7:07:41.000 AM|
|                NULL|low battery|    Stillman Tatum|Nest T3021US Ther...| 2:07:18.000 PM|
|                NULL|   power on|   Carolann Fernez|         Amazon Echo| 7:33:40.000 AM|
|                NULL|  power off|      Gran Torbeck| August Doorbell Cam|11:30:10.000 AM|
|                NULL|low battery|Therine Jakubowski|         Amazon Echo| 4:52:14.000 PM|

## Reading all files in a directory
We can read all JSON files from a directory into DataFrame just by passing directory as a path to the json() method.

In [21]:
# Read all JSON files from a folder
df3 = spark.read.json("./resources/json_files/n*.json")
df3.show()

25/08/09 17:27:10 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: ./resources/json_files/n*.json.
java.io.FileNotFoundException: File resources/json_files/n*.json does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:917)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1238)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:907)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:56)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:381)
	at org.apache.spark.sql.catalyst.analysis.ResolveDataSource.org$apache$spark$sql$catalyst$analysis$ResolveDataSource$$loadV1BatchSource(ResolveDataSource.scala:143)
	at org.apache.spark.sql.catalyst.an

+--------------------+-----------+------------------+--------------------+---------------+
|     _corrupt_record|     action|          customer|              device|           time|
+--------------------+-----------+------------------+--------------------+---------------+
|[{"time":"7:15:38...|       NULL|              NULL|                NULL|           NULL|
|                NULL|   power on| Pietro MacMenamie|         Amazon Echo|10:17:17.000 AM|
|                NULL|   power on|  Diego Caudrelier|  GreenIQ Controller|10:19:17.000 AM|
|                NULL|low battery|       Ben Humpage|         Amazon Echo| 7:07:41.000 AM|
|                NULL|low battery|    Stillman Tatum|Nest T3021US Ther...| 2:07:18.000 PM|
|                NULL|   power on|   Carolann Fernez|         Amazon Echo| 7:33:40.000 AM|
|                NULL|  power off|      Gran Torbeck| August Doorbell Cam|11:30:10.000 AM|
|                NULL|low battery|Therine Jakubowski|         Amazon Echo| 4:52:14.000 PM|

## Reading files with a user-specified custom schema
PySpark Schema defines the structure of the data, in other words, it is the structure of the DataFrame. PySpark SQL provides StructType & StructField classes to programmatically specify the structure to the DataFrame.

If you know the schema of the file ahead and do not want to use the default inferSchema option, use schema option to specify user-defined custom column names and data types.

In [18]:
from pyspark.sql.types import IntegerType, BooleanType, DoubleType, StringType, StructType, StructField


# Define custom schema
schema = StructType([
      StructField("RecordNumber",IntegerType(),True),
      StructField("Zipcode",IntegerType(),True),
      StructField("ZipCodeType",StringType(),True),
      StructField("City",StringType(),True),
      StructField("State",StringType(),True),
      StructField("LocationType",StringType(),True),
      StructField("Lat",DoubleType(),True),
      StructField("Long",DoubleType(),True),
      StructField("Xaxis",DoubleType(),True),
      StructField("Yaxis",DoubleType(),True),
      StructField("Zaxis",DoubleType(),True),
      StructField("WorldRegion",StringType(),True),
      StructField("Country",StringType(),True),
      StructField("LocationText",StringType(),True),
      StructField("Location",StringType(),True),
      StructField("Decommisioned",BooleanType(),True),
      StructField("TaxReturnsFiled",IntegerType(),True),
      StructField("EstimatedPopulation",IntegerType(),True),
      StructField("TotalWages",IntegerType(),True),
      StructField("Notes",StringType(),True)
  ])

df_with_schema = spark.read.schema(schema).json("./resources/json_files/zipcodes.json")
# df_with_schema = spark.read.json("./resources/json_files/zipcodes.json")
df_with_schema.printSchema()
df_with_schema.show()

root
 |-- RecordNumber: integer (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- LocationType: string (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Long: double (nullable = true)
 |-- Xaxis: double (nullable = true)
 |-- Yaxis: double (nullable = true)
 |-- Zaxis: double (nullable = true)
 |-- WorldRegion: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- LocationText: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Decommisioned: boolean (nullable = true)
 |-- TaxReturnsFiled: integer (nullable = true)
 |-- EstimatedPopulation: integer (nullable = true)
 |-- TotalWages: integer (nullable = true)
 |-- Notes: string (nullable = true)

+------------+-------+-----------+-------------------+-----+--------------+-----+-------+-----+-----+-----+-----------+-------+--------------------+--------------------+------

##Read JSON file using PySpark SQL
PySpark SQL also provides a way to read a JSON file by creating a temporary view directly from the reading file using spark.sqlContext.sql(“load JSON to temporary view”)

In [19]:
spark.sql("CREATE OR REPLACE TEMPORARY VIEW zipcode USING json OPTIONS" + " (path './resources/json_files/zipcodes.json')")
spark.sql("select * from zipcode").show()

+-------------------+-------+-------------+-------------------+-----+--------------------+--------------------+--------------+-------+-------------+------------+-----+---------------+----------+-----------+-----+-----+-----+-----------+-------+
|               City|Country|Decommisioned|EstimatedPopulation|  Lat|            Location|        LocationText|  LocationType|   Long|        Notes|RecordNumber|State|TaxReturnsFiled|TotalWages|WorldRegion|Xaxis|Yaxis|Zaxis|ZipCodeType|Zipcode|
+-------------------+-------+-------------+-------------------+-----+--------------------+--------------------+--------------+-------+-------------+------------+-----+---------------+----------+-----------+-----+-----+-----+-----------+-------+
|        PARC PARQUE|     US|        false|               NULL|17.96|NA-US-PR-PARC PARQUE|     Parc Parque, PR|NOT ACCEPTABLE| -66.22|         NULL|           1|   PR|           NULL|      NULL|         NA| 0.38|-0.87|  0.3|   STANDARD|    704|
|PASEO COSTA DEL SUR

## PySpark Saving modes
PySpark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes overwrite, append, ignore, errorifexists.

* overwrite – mode is used to overwrite the existing file
* append – To add the data to the existing file
* ignore – Ignores write operation when the file already exists
* errorifexists or error – This is a default option when the file already exists, it returns an error

In [20]:
# Read multiline json file
multiline_df = spark.read.option("multiline","true") \
      .json("./resources/json_files/multiline-zipcode.json")

multiline_df.show() 

multiline_df.write.mode('Overwrite').json("./resources/json_files/writenFile.json")


+-------------------+------------+-----+-----------+-------+
|               City|RecordNumber|State|ZipCodeType|Zipcode|
+-------------------+------------+-----+-----------+-------+
|PASEO COSTA DEL SUR|           2|   PR|   STANDARD|    704|
|       BDA SAN LUIS|          10|   PR|   STANDARD|    709|
+-------------------+------------+-----+-----------+-------+

