# Spark SQL

### Introduction

### Getting Started

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [11]:
claims_df = spark.read.format("csv").csv("./houston_claims.csv", header = True, inferSchema = True)

In [12]:
claims_df.createOrReplaceTempView("claims")

In [14]:
claims_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- reportedCity: string (nullable = true)
 |-- dateOfLoss: timestamp (nullable = true)
 |-- elevatedBuildingIndicator: boolean (nullable = true)
 |-- floodZone: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- lowestFloodElevation: double (nullable = true)
 |-- amountPaidOnBuildingClaim: double (nullable = true)
 |-- amountPaidOnContentsClaim: double (nullable = true)
 |-- yearofLoss: timestamp (nullable = true)
 |-- reportedZipcode: integer (nullable = true)
 |-- id: string (nullable = true)



In [16]:
spark.sql("SELECT * FROM claims ORDER BY yearofLoss LIMIT 3;").show(vertical = True)

-RECORD 0-----------------------------------------
 _c0                       | 2468                 
 reportedCity              | HOUSTON              
 dateOfLoss                | 1985-09-28 20:00:00  
 elevatedBuildingIndicator | false                
 floodZone                 | C                    
 latitude                  | 29.8                 
 longitude                 | -95.5                
 lowestFloodElevation      | null                 
 amountPaidOnBuildingClaim | 0.0                  
 amountPaidOnContentsClaim | 0.0                  
 yearofLoss                | 1984-12-31 19:00:00  
 reportedZipcode           | 77024                
 id                        | 5e398d7074cbd479f... 
-RECORD 1-----------------------------------------
 _c0                       | 4029                 
 reportedCity              | HOUSTON              
 dateOfLoss                | 1985-09-29 20:00:00  
 elevatedBuildingIndicator | false                
 floodZone                 | C 

> Now we can also use a WHERE clause.

In [19]:
spark.sql("SELECT * FROM claims WHERE latitude = 29.8 LIMIT 2;").show(vertical = True)

-RECORD 0-----------------------------------------
 _c0                       | 2                    
 reportedCity              | HOUSTON              
 dateOfLoss                | 2004-06-28 20:00:00  
 elevatedBuildingIndicator | false                
 floodZone                 | X                    
 latitude                  | 29.8                 
 longitude                 | -95.6                
 lowestFloodElevation      | null                 
 amountPaidOnBuildingClaim | 1420.89              
 amountPaidOnContentsClaim | 0.0                  
 yearofLoss                | 2003-12-31 19:00:00  
 reportedZipcode           | 77042                
 id                        | 5e398d6774cbd479f... 
-RECORD 1-----------------------------------------
 _c0                       | 3                    
 reportedCity              | HOUSTON              
 dateOfLoss                | 2009-04-27 20:00:00  
 elevatedBuildingIndicator | false                
 floodZone                 | X 

In [20]:
spark.sql("SELECT * FROM claims WHERE latitude = 29.8 LIMIT 2;").explain()

== Physical Plan ==
CollectLimit 2
+- *(1) Filter (isnotnull(latitude#155) AND (latitude#155 = 29.8))
   +- FileScan csv [_c0#150,reportedCity#151,dateOfLoss#152,elevatedBuildingIndicator#153,floodZone#154,latitude#155,longitude#156,lowestFloodElevation#157,amountPaidOnBuildingClaim#158,amountPaidOnContentsClaim#159,yearofLoss#160,reportedZipcode#161,id#162] Batched: false, DataFilters: [isnotnull(latitude#155), (latitude#155 = 29.8)], Format: CSV, Location: InMemoryFileIndex[file:/Users/jeff/Library/Mobile Documents/com~apple~CloudDocs/Documents/jigsaw/..., PartitionFilters: [], PushedFilters: [IsNotNull(latitude), EqualTo(latitude,29.8)], ReadSchema: struct<_c0:int,reportedCity:string,dateOfLoss:timestamp,elevatedBuildingIndicator:boolean,floodZo...




### Filter

### Resources

[Select and Filter Blog](https://hendra-herviawan.github.io/sparksql-select-filter.html)