## Sample tutorial to read json file using Apache Spark
With a SparkSession or sqlContext, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.

As an example, the following creates a DataFrame based on the content of a JSON file:

In [1]:
testJsonData = sqlContext.read.json("/data/year=2017/month=7/day=4/hour=14/dump.json")

In [2]:
# print the schema
testJsonData.printSchema()

root
 |-- host_ip: string (nullable = true)
 |-- rawdata: string (nullable = true)
 |-- src: string (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- hour: integer (nullable = true)



In [3]:
# Displays the content of the DataFrame to stdout
testJsonData.show()

+-------+--------------------+---+-------------+----+-----+---+----+
|host_ip|             rawdata|src|    timestamp|year|month|day|hour|
+-------+--------------------+---+-------------+----+-----+---+----+
|my_ipv6|python-random-618...|ESC|1499170050873|2017|    7|  4|  14|
|my_ipv6|python-random-259...|ESC|1499170051727|2017|    7|  4|  14|
|my_ipv6|python-random-998...|ESC|1499170130638|2017|    7|  4|  14|
|my_ipv6|python-random-120...|ESC|1499170131380|2017|    7|  4|  14|
|my_ipv6|python-random-393...|ESC|1499170131993|2017|    7|  4|  14|
|my_ipv6|python-random-775...|ESC|1499170132597|2017|    7|  4|  14|
|my_ipv6|python-random-282...|ESC|1499170133135|2017|    7|  4|  14|
|my_ipv6|python-random-588...|ESC|1499170133706|2017|    7|  4|  14|
+-------+--------------------+---+-------------+----+-----+---+----+



## Untyped Dataset Operations (aka DataFrame Operations)
DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, Python and R.

DataFrames are just Dataset of Rows. These operations are also referred as “untyped transformations” in contrast to “typed transformations” come with strongly typed Scala/Java Datasets.

Here we include some basic examples of structured data processing using Datasets:

In [4]:
# Select only the "rawdata" column
testJsonData.select("rawdata").show()

# Select src and month, but increment only the month by 1
testJsonData.select(testJsonData['src'], testJsonData['month'] + 1).show()

# Select data where timestamp > 1499170131380
testJsonData.filter(testJsonData['timestamp'] > 1499170131380).show()

+--------------------+
|             rawdata|
+--------------------+
|python-random-618...|
|python-random-259...|
|python-random-998...|
|python-random-120...|
|python-random-393...|
|python-random-775...|
|python-random-282...|
|python-random-588...|
+--------------------+

+---+-----------+
|src|(month + 1)|
+---+-----------+
|ESC|          8|
|ESC|          8|
|ESC|          8|
|ESC|          8|
|ESC|          8|
|ESC|          8|
|ESC|          8|
|ESC|          8|
+---+-----------+

+-------+--------------------+---+-------------+----+-----+---+----+
|host_ip|             rawdata|src|    timestamp|year|month|day|hour|
+-------+--------------------+---+-------------+----+-----+---+----+
|my_ipv6|python-random-393...|ESC|1499170131993|2017|    7|  4|  14|
|my_ipv6|python-random-775...|ESC|1499170132597|2017|    7|  4|  14|
|my_ipv6|python-random-282...|ESC|1499170133135|2017|    7|  4|  14|
|my_ipv6|python-random-588...|ESC|1499170133706|2017|    7|  4|  14|
+-------+--------------

## Running SQL Queries Programmatically


In [5]:
# Register the DataFrame as a SQL temporary view
sqlContext.registerDataFrameAsTable(testJsonData, "data")

sqlDF = sqlContext.sql("SELECT * FROM data")
sqlDF.show()

+-------+--------------------+---+-------------+----+-----+---+----+
|host_ip|             rawdata|src|    timestamp|year|month|day|hour|
+-------+--------------------+---+-------------+----+-----+---+----+
|my_ipv6|python-random-618...|ESC|1499170050873|2017|    7|  4|  14|
|my_ipv6|python-random-259...|ESC|1499170051727|2017|    7|  4|  14|
|my_ipv6|python-random-998...|ESC|1499170130638|2017|    7|  4|  14|
|my_ipv6|python-random-120...|ESC|1499170131380|2017|    7|  4|  14|
|my_ipv6|python-random-393...|ESC|1499170131993|2017|    7|  4|  14|
|my_ipv6|python-random-775...|ESC|1499170132597|2017|    7|  4|  14|
|my_ipv6|python-random-282...|ESC|1499170133135|2017|    7|  4|  14|
|my_ipv6|python-random-588...|ESC|1499170133706|2017|    7|  4|  14|
+-------+--------------------+---+-------------+----+-----+---+----+



### Further reading
[https://spark.apache.org/docs/1.6.1/sql-programming-guide.html](https://spark.apache.org/docs/1.6.1/sql-programming-guide.html)

[https://docs.databricks.com/spark/latest/spark-sql/index.html](https://docs.databricks.com/spark/latest/spark-sql/index.html)