# Day 11 - Managing Metadata: Catalog, Tables and Views

## Python API
###  Class: pyspark.sql.session.[SparkSession](https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html#pyspark.sql.SparkSession)
Usually the SparkSession object is usually assigned to the variable named *spark*. 

#### Object Properties:
* **catalog** - to access the the `Catalog`interface for maintaining metadata regarding databases, tables, functions, etc.

#### Object Methods: (reading from data source Table -> DataFrame)
* **table()** - Returns the specified table as a DataFrame

This function corresponds to the `DataFrameReader`, I'm using to read from file based data sources.

### Class: pyspark.sql.dataframe.[DataFrame](https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html#pyspark.sql.DataFrame)

#### Object Properties:
* **write** - to access the `DataFrameWriter` interface for writing from a DataFrame to a data sink
* **writeStream** - to access the `DataStreamWriter` object for writing Stream data to external storage.

#### Object Methods:
* **createGlobalTempView()**
* **createOrReplaceGlobalTempView()** 
* **createOrReplaceTempView()** 
* **createTempView()** 
* **registerTempTable()** 

Tha function names indicate the following characteristics:
* *Temp*, i.e. temporary, means the lifetime of the table/view is tied to the SparkSession that was used to create it
* *Global* means, the tabls/view is known to all clusters, all others are locally to a cluster
* *orReplace* means, no exception is thrown, if a table/view already exists in the Catalog.

### Class: pyspark.sql.readwriter.[DataFrameWriter](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter)

Interface used to write a `DataFrame` to external storage systems (e.g. file systems, key-value stores, etc). Accessed through the `DataFrame.write` property
#### Object Methods: (writing DataFrame -> Table as data sink)
* **bucketBy()** - Buckets the output by the given columns.If specified, the output is laid out on the file system similar to Hive’s bucketing scheme
* **insertInto()** - Inserts the content of the DataFrame to the specified table.It requires that the schema of the `DataFrame` is the same as the schema of the table.
* **saveAsTable()** - Saves the content of the DataFrame as the specified table.

### Class: pyspark.sql.catalog.[Catalog](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Catalog) `[abstract]`
Interface to the `Catalog`, accessed through the `SparkSession.catalog` property.

To me, it looks like this *pyspark* interface class provides only a sub-set of the Spark Catalog functionality. To have all meta-data functions available, I must use the [SQL API](https://docs.databricks.com/spark/latest/spark-sql/language-manual/index.html). Therefore I list just the most important functions, just to have a reminder, how pyspark inter-acts with the Spark Catalog.
#### Class Functions:
* **createExternalTable()** - Creates an **unmanaged** table based on the dataset in a data source. It returns the DataFrame associated with the external table. 
* **createTable()** - Creates a table based on the dataset in a data source. It returns the DataFrame associated with the table. When path is specified, an **unmanaged** external table is created from the data at the given path. Otherwise a **managed** table is created.Optionally.
* **dropGlobalTempView()** - Drops the global temporary view with the given view name in the catalog. 
* **dropTempView()** - Drops the local temporary view with the given view name in the catalog. 

## SQL API
Ref. [SQL Language Reference](https://docs.databricks.com/spark/latest/spark-sql/language-manual/index.html) provided by Databricks.

Only the SQL API provides the full scope of metadata management functionality in the Spark Catalog.