# Spark SQL

Spark SQL essentially tries to bridge the gap between the two models we mentioned previously—the relational and procedural models.

Spark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark's built-in distributed collections—at scale!

To support a wide variety of diverse data sources and algorithms in Big Data, Spark SQL introduces a novel extensible optimizer called Catalyst, which makes it easy to add data sources, optimization rules, and data types for advanced analytics such as machine learning. Essentially, Spark SQL leverages the power of Spark to perform distributed, robust, in-memory computations at massive scale on Big Data.

## Read the csv data 

In [8]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('../data/auction.csv',header=True, inferSchema=True)
df

DataFrame[auctionid: bigint, bid: double, bidtime: double, bidder: string, bidderrate: string, openbid: double, price: double, item: string, auction_type: string]

## Show()

In [11]:
df.show(2)

+----------+-----+--------+------------+----------+-------+-----+------------------+-------------+
| auctionid|  bid| bidtime|      bidder|bidderrate|openbid|price|              item| auction_type|
+----------+-----+--------+------------+----------+-------+-----+------------------+-------------+
|1638893549|175.0|2.230949|schadenfreud|         0|   99.0|177.5|Cartier wristwatch|3 day auction|
|1638893549|100.0|2.600116|       chuik|         0|   99.0|177.5|Cartier wristwatch|3 day auction|
+----------+-----+--------+------------+----------+-------+-----+------------------+-------------+
only showing top 2 rows



## Data Schema

In [13]:
df.printSchema()

root
 |-- auctionid: long (nullable = true)
 |-- bid: double (nullable = true)
 |-- bidtime: double (nullable = true)
 |-- bidder: string (nullable = true)
 |-- bidderrate: string (nullable = true)
 |-- openbid: double (nullable = true)
 |-- price: double (nullable = true)
 |-- item: string (nullable = true)
 |-- auction_type: string (nullable = true)



## Get the columns

In [15]:
df.columns

['auctionid',
 'bid',
 'bidtime',
 'bidder',
 'bidderrate',
 'openbid',
 'price',
 'item',
 'auction_type']

## Describe()

In [21]:
df.describe().select("summary","price").show()

+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|             10681|
|   mean|335.04358861528874|
| stddev| 433.5660087308641|
|    min|              26.0|
|    max|            5400.0|
+-------+------------------+



## Summary()

In [24]:
df.summary().select("summary","price").show()

+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|             10681|
|   mean|335.04358861528874|
| stddev| 433.5660087308641|
|    min|              26.0|
|    25%|            186.51|
|    50%|            228.01|
|    75%|             255.0|
|    max|            5400.0|
+-------+------------------+



## Custom Data Schema

In [26]:
from pyspark.sql.types import StructField, IntegerType, StringType, StructType
address_schema = [StructField('city',StringType(),True),StructField('state',StringType(),True)]
final_add_schema = StructType(fields=address_schema)
data_schema = [StructField('id',IntegerType(),True),StructField('name',StringType(),True),final_add_schema]
final_struc = StructType(fields=data_schema)
df = spark1.read.json('../data/data.json',schema=final_struc)
df.show()


+----------+------------+
| auctionid|      bidder|
+----------+------------+
|1638893549|schadenfreud|
|1638893549|       chuik|
+----------+------------+
only showing top 2 rows

