# Spark Tables

This notebook shows how to use Spark Catalog Interface API to query databases, tables, and columns.

A full list of documented methods is available [here](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Catalog)

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("Spark Tables").enableHiveSupport().getOrCreate()

In [2]:
us_flights_file = "/home/karthik/SparkCourse/pyspark notebooks/data/departuredelays.csv"

In [3]:
#Creating a database and a table
spark.sql("""DROP DATABASE IF EXISTS sparklearn_db""")
spark.sql("""CREATE DATABASE IF NOT EXISTS sparklearn_db""")
spark.sql("""USE sparklearn_db""")
spark.sql("""CREATE TABLE us_delays_flights_tbl(date STRING, delay INT, distance INT, origin STRING, destination STRING)""")

DataFrame[]

### Display the tables

In [4]:
spark.sql("""SHOW TABLES""").show(10,False)

+-------------+---------------------+-----------+
|database     |tableName            |isTemporary|
+-------------+---------------------+-----------+
|sparklearn_db|us_delays_flights_tbl|false      |
+-------------+---------------------+-----------+



## Read our US Flights table

In [7]:
df = spark.read.csv(us_flights_file, header=True,schema="date STRING, delay INT, distance INT, origin STRING, destination STRING")

In [8]:
df.show(5,False)

+--------+-----+--------+------+-----------+
|date    |delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01011245|6    |602     |ABE   |ATL        |
|01020600|-8   |369     |ABE   |DTW        |
|01021245|-2   |602     |ABE   |ATL        |
|01020605|-4   |602     |ABE   |ATL        |
|01031245|-4   |602     |ABE   |ATL        |
+--------+-----+--------+------+-----------+
only showing top 5 rows



## Save into our table

In [11]:
df.write.mode("overwrite").saveAsTable("us_delays_flights_tbl")

## Cache the Table

In [12]:
spark.sql("""CACHE TABLE us_delays_flights_tbl""")

DataFrame[]

Check if the table is cached

In [13]:
spark.catalog.isCached("us_delays_flights_tbl")

True

### Display tables within a Database

Note that the table is MANGED by Spark

In [14]:
spark.catalog.listTables("sparklearn_db")

[Table(name='us_delays_flights_tbl', database='sparklearn_db', description=None, tableType='MANAGED', isTemporary=False)]

### Display Columns for a table

In [15]:
spark.catalog.listColumns("us_delays_flights_tbl")

[Column(name='date', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='delay', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='distance', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='origin', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='destination', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False)]

### Create Unmanaged Tables

In [16]:
spark.sql("USE sparklearn_db")
spark.sql("""CREATE TABLE us_delays_flights_tbl_um(date STRING, delay INT, distance INT, origin STRING, destination STRING)
USING csv
OPTIONS(path "/home/karthik/SparkCourse/pyspark notebooks/data/departuredelays.csv", header="True")""")

DataFrame[]

### Display Tables

**Note**: The table type here that tableType='EXTERNAL', which indicates it's unmanaged by Spark, whereas above the tableType='MANAGED'

In [17]:
spark.catalog.listTables("sparklearn_db")

[Table(name='us_delays_flights_tbl', database='sparklearn_db', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='us_delays_flights_tbl_um', database='sparklearn_db', description=None, tableType='EXTERNAL', isTemporary=False)]

### Display Columns for a table

In [18]:
spark.catalog.listColumns("us_delays_flights_tbl_um")

[Column(name='date', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='delay', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='distance', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='origin', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='destination', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False)]