# Spark Sql Options
Pyspark provides two main options when it comes to using staight SQl, spark Sql and Sql transformer. 

### 1. Spark SQL
Spark TempView provides two functions that allows users to run SQL queries against a Spark DataFrame:

- **CreateOrReplaceTempView**: The lifetime of this temporary view is tied to the SparkSession that was used to create a dataset. It creates (or replaces if that view name aleady exist) a lazily evalutated "view" that can then use like a hive table in spark sql. It does not presist to memory unless you cache the dataset that underpins the view. 
- **createGlobalTempView**: The lifetime of this temporary view is tied to this spark application. This feature is useful when you want to share data among different sessions and keep alive until your application ends.

**Spark Session** vs. **Spark application**
### Spark application 
- Single batch jobs
- An interactive session with multiple jobs
- A long-lived server continually satisfying requests
- A Spark job can consist of more that just a single map and reduce
- Can conist of more than one Spark Session

### SparkSession on the other hand
- Interaction between two or more entries
- Can be created without creating SparkConf, SparkContext or SQLContext

### SQL Transformer
You also have the option to use the SQL transformer option where you can write free-form SQL scripts as well

## SQL Options within regular PySpark calls
1. The expr function in PySparks SQL function Lib
2. PySparks selectExpr Function


In [2]:
import pyspark
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
spark

In [4]:
path = '/user/harishmohan/Datasets/'
crime = spark.read.csv(path + "rec-crime-pfa.csv", header=True, inferSchema=True)

In [5]:
crime.limit(5).toPandas()

Unnamed: 0,12 months ending,PFA,Region,Offence,Rolling year total number of offences
0,31/03/2003,Avon and Somerset,South West,All other theft offences,25959
1,31/03/2003,Avon and Somerset,South West,Bicycle theft,3090
2,31/03/2003,Avon and Somerset,South West,Criminal damage and arson,26202
3,31/03/2003,Avon and Somerset,South West,Death or serious injury caused by illegal driving,2
4,31/03/2003,Avon and Somerset,South West,Domestic burglary,14561


In [6]:
crime.printSchema()

root
 |-- 12 months ending: string (nullable = true)
 |-- PFA: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Offence: string (nullable = true)
 |-- Rolling year total number of offences: integer (nullable = true)



In [7]:
df = crime.withColumnRenamed('Rolling year total number of offences', 'Count')

In [9]:
df.printSchema()

root
 |-- 12 months ending: string (nullable = true)
 |-- PFA: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Offence: string (nullable = true)
 |-- Count: integer (nullable = true)



In [10]:
df.createOrReplaceTempView("tempView")

In [18]:
sql_results = spark.sql("SELECT * \
                        FROM tempView \
                        WHERE Count > 1000")

sql_results.limit(5).toPandas()

Unnamed: 0,12 months ending,PFA,Region,Offence,Count
0,31/03/2003,Avon and Somerset,South West,All other theft offences,25959
1,31/03/2003,Avon and Somerset,South West,Bicycle theft,3090
2,31/03/2003,Avon and Somerset,South West,Criminal damage and arson,26202
3,31/03/2003,Avon and Somerset,South West,Domestic burglary,14561
4,31/03/2003,Avon and Somerset,South West,Drug offences,2308


In [23]:
spark.sql("SELECT Region, sum(Count) AS Total, avg(Count) AS Avg, count(Count) AS Observations  \
          FROM tempview GROUP BY Region").toPandas()

Unnamed: 0,Region,Total,Avg,Observations
0,Fraud: CIFAS,7678981,247709.064516,31
1,North West,30235732,5737.33055,5270
2,British Transport Police,3029117,2873.925047,1054
3,Wales,11137260,2641.665085,4216
4,London,42691902,20252.325427,2108
5,South East,30911995,5865.6537,5270
6,Fraud: Action Fraud,5921984,191031.741935,31
7,Fraud: UK Finance,2925861,94382.612903,31
8,South West,17985880,3412.880455,5270
9,East,19890612,3145.258065,6324


In [24]:
from pyspark.ml.feature import SQLTransformer 

In [31]:
sqlTrans = SQLTransformer(statement = "SELECT PFA, Region, Offence \
                                        FROM __THIS__")

In [32]:
sqlTrans.transform(df).show(5)

+-----------------+----------+--------------------+
|              PFA|    Region|             Offence|
+-----------------+----------+--------------------+
|Avon and Somerset|South West|All other theft o...|
|Avon and Somerset|South West|       Bicycle theft|
|Avon and Somerset|South West|Criminal damage a...|
|Avon and Somerset|South West|Death or serious ...|
|Avon and Somerset|South West|   Domestic burglary|
+-----------------+----------+--------------------+
only showing top 5 rows



In [33]:
type(sqlTrans)

pyspark.ml.feature.SQLTransformer

In [38]:
sqlTrans = SQLTransformer(statement = "SELECT Offence, SUM(Count) as Total FROM __THIS__ \
                                      GROUP BY Offence")

In [42]:
Total_Offence = SQLTransformer(statement = "SELECT SUM(Count) AS Total_Offence FROM __THIS__")
Total_Offence.transform(df).show()

+-------------+
|Total_Offence|
+-------------+
|    244720928|
+-------------+



In [39]:
sqlTrans.transform(df).show(5)

+--------------------+--------+
|             Offence|   Total|
+--------------------+--------+
|Public order offe...|10925676|
|       Bicycle theft| 5297006|
|Residential burglary| 1671469|
|Violence without ...|16590158|
|All other theft o...|30979393|
+--------------------+--------+
only showing top 5 rows



In [40]:
from pyspark.sql.functions import expr

In [45]:
df.withColumn("percentage", expr('round((count/ 244720928)*100,2)')).show()

+----------------+-----------------+----------+--------------------+-----+----------+
|12 months ending|              PFA|    Region|             Offence|Count|percentage|
+----------------+-----------------+----------+--------------------+-----+----------+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|25959|      0.01|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft| 3090|       0.0|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|26202|      0.01|
|      31/03/2003|Avon and Somerset|South West|Death or serious ...|    2|       0.0|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|14561|      0.01|
|      31/03/2003|Avon and Somerset|South West|       Drug offences| 2308|       0.0|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences| 5339|       0.0|
|      31/03/2003|Avon and Somerset|South West|            Homicide|   19|       0.0|
|      31/03/2003|Avon and Somerset|South West|Miscell

In [47]:
df.selectExpr("*", 'round((count/244720928)*100, 2) AS Percent').filter("Region = 'South West'").toPandas()

Unnamed: 0,12 months ending,PFA,Region,Offence,Count,Percent
0,31/03/2003,Avon and Somerset,South West,All other theft offences,25959,0.01
1,31/03/2003,Avon and Somerset,South West,Bicycle theft,3090,0.00
2,31/03/2003,Avon and Somerset,South West,Criminal damage and arson,26202,0.01
3,31/03/2003,Avon and Somerset,South West,Death or serious injury caused by illegal driving,2,0.00
4,31/03/2003,Avon and Somerset,South West,Domestic burglary,14561,0.01
...,...,...,...,...,...,...
5265,31/12/2018,Wiltshire,South West,Stalking and harassment,2380,0.00
5266,31/12/2018,Wiltshire,South West,Theft from the person,347,0.00
5267,31/12/2018,Wiltshire,South West,Vehicle offences,2895,0.00
5268,31/12/2018,Wiltshire,South West,Violence with injury,5701,0.00


### Another DataSet

Contains a list of Google Play Store Apps and info about the apps like the category, rating, reviews, size, etc. 

**Source:** https://www.kaggle.com/lava18/google-play-store-apps

In [48]:
google = spark.read.csv(path + "googleplaystore.csv", header=True, inferSchema=True)

In [49]:
google.limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [51]:
print(google.printSchema())

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Reviews: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)

None


In [52]:
from pyspark.sql.types import *

In [55]:
df = google.withColumn("Rating", google["Rating"].cast(FloatType())) \
           .withColumn("Reviews", google["Reviews"].cast(IntegerType())) \
           .withColumn("Price", google["Price"].cast(IntegerType()))
print(df.printSchema())
df.limit(5).toPandas()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: float (nullable = true)
 |-- Reviews: integer (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: integer (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)

None


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [56]:
df.createOrReplaceTempView("tempView")

In [58]:
spark.sql("SELECT * \
          FROM tempview \
          WHERE Rating > 4.1").limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
1,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
2,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
3,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
4,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up


In [63]:
sql_results = spark.sql("SELECT App, Rating \
                         FROM tempView \
                         WHERE Category = 'COMICS' AND Rating > 4.5")
sql_results.limit(5).toPandas() 

Unnamed: 0,App,Rating
0,Manga Master - Best manga & comic reader,4.6
1,GANMA! - All original stories free of charge f...,4.7
2,Röhrich Werner Soundboard,4.7
3,Unicorn Pokez - Color By Number,4.8
4,Manga - read Thai translation,4.6


In [64]:
spark.sql("SELECT Category, sum(Reviews) AS Total_Reviews \
           FROM tempview \
           GROUP BY Category \
           ORDER BY Total_Reviews DESC").limit(1).toPandas()

Unnamed: 0,Category,Total_Reviews
0,GAME,1585422349


In [65]:
spark.sql("SELECT App, Reviews \
           FROM tempview \
           ORDER BY Reviews DESC").show(1) 

+--------+--------+
|     App| Reviews|
+--------+--------+
|Facebook|78158306|
+--------+--------+
only showing top 1 row



In [66]:
spark.sql("SELECT * \
           FROM tempview WHERE App like '%dating%'").limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,"Meet, chat & date. Free dating app - Chocolate...",DATING,3.9,8661,9.5M,"1,000,000+",Free,0,Mature 17+,Dating,"April 3, 2018",0.1.11,4.0 and up
1,Friend Find: free chat + flirt dating app,DATING,,23,11M,100+,Free,0,Mature 17+,Dating,"July 31, 2018",1.0,4.4 and up
2,Spine- The dating app,DATING,5.0,5,9.3M,500+,Free,0,Teen,Dating,"July 14, 2018",4.0,4.0.3 and up
3,Princess Closet : Otome games free dating sim,FAMILY,4.5,29495,56M,"1,000,000+",Free,0,Teen,Simulation,"May 24, 2018",1.11.0,4.0.3 and up
4,happn – Local dating app,LIFESTYLE,4.3,1118201,Varies with device,"10,000,000+",Free,0,Mature 17+,Lifestyle,"July 24, 2018",Varies with device,Varies with device


In [68]:
from pyspark.ml.feature import SQLTransformer

In [71]:
sqlTrans = SQLTransformer(statement = "SELECT count(*) AS Total FROM __THIS__ WHERE Type = 'Free'")
sqlTrans.transform(df).show()

+-----+
|Total|
+-----+
|10037|
+-----+



In [73]:
sqlTrans = SQLTransformer(statement = "SELECT Genres, count(*) as Total FROM __THIS__ GROUP BY Genres ORDER BY Total DESC")
sqlTrans.transform(df).show(1)

+------+-----+
|Genres|Total|
+------+-----+
| Tools|  842|
+------+-----+
only showing top 1 row



In [75]:
sqlTrans = SQLTransformer(statement = "SELECT App, Reviews FROM __THIS__ WHERE Genres = 'Tools' AND Reviews > 100")
sqlTrans.transform(df).show(10)

+--------------------+--------+
|                 App| Reviews|
+--------------------+--------+
|   Moto File Manager|   38655|
|              Google| 8033493|
|    Google Translate| 5745093|
|        Moto Display|   18239|
|      Motorola Alert|   24199|
|     Motorola Assist|   37333|
|Cache Cleaner-DU ...|12759663|
|  Moto Suggestions ™|     308|
|          Moto Voice|   33216|
|          Calculator|   40770|
+--------------------+--------+
only showing top 10 rows

