In [None]:
spark

# Spark SQL as a temporary view

In [None]:
from pyspark.sql import functions as F
from pyspark.sql import types as T

In [None]:
simpleData = [
    ("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000),
]

columns= ["employee_name","department","state","salary","age","bonus"]

df = spark.createDataFrame(data = simpleData, schema = columns)

## `spark.sql()`  to execute Spark SQL statement

Spark provides `sql()`in `SparkSession` to allow us to use SQL statement with data frame, however without register a dataframe as a temporary view, Spark will not know what is the table name to reference (and a table name is a part of a `catalog` system)

In [None]:
spark.sql("select * from df")

## `createOrReplaceTempView()` to register a dataframe as a tempoary view

In [None]:
df.createOrReplaceTempView("EMPLOYEES")

After registering the `df` dataframe as temporary called `EMPLOYEES`, we can now call sql command to work with data in dataframe like DataFrame API.

In [None]:
spark.sql("select * from EMPLOYEES")

As the output has shown, the result of `spark.sql()` is a dataframe which can be displayed by calling `show()`.

In [None]:
spark.sql("select * from EMPLOYEES").show(truncate=False)

In this learning environment, the `sparkmagic` kernel provide us with `%%sql` to allow us execute SQL command directly without passing as an argument in `spark.sql()`. Additionally the kernel displays the return result automatically.

In [None]:
%%sql

select * from EMPLOYEES

In [None]:
spark.sql("select count(*) from EMPLOYEES").collect()[0][0]

## Default Spark Catalog  (SparkSessionCatalog)

Actually when a SparkSession starts, it will create a empty Spark catalog system. Normally a catalog will contains metadata related with database, table, view, and temporary view (which exists within a session). So our `df` dataframe which is registered as `EMPLOYEES` will be terminated after the session ends.

In [None]:
# Please restart kernel and execute the first cell to trigger creating a newly SparkSession
# After that try to execute following command
spark.sql("select * from EMPLOYEES").show(truncate=False)

## How to manage with a Spark catalog

Spark provides a catalog object for working with metadata about databases, tables, views, and temporary views. We can access it view `spark.catalog` property.

In [None]:
spark.catalog

### List all databases

In [None]:
spark.catalog.listDatabases()

### List all tables

In [None]:
spark.catalog.listTables("default")  # or spark.catalog.listTables() because "default" is a default argument.

### Drop a temporary view

In [None]:
spark.catalog.dropTempView("EMPLOYEES")

In [None]:
spark.sql("select * from EMPLOYEES").show(truncate=False)

There are more to deep dive at [PySpark Catalog API](https://spark.apache.org/docs/3.2.4/api/python/reference/api/pyspark.sql.Catalog.html?highlight=catalog#pyspark.sql.Catalog)