## Define Schema for Tables

When we want to create a table using `spark.catalog.createTable` or using `spark.catalog.createExternalTable`, we need to specify Schema.

* Schema can be inferred or we can pass schema using `StructType` object while creating the table..
* `StructType` takes list of objects of type `StructField`.
* `StructField` is built using column name and data type. All the data types are available under `pyspark.sql.types`.
* We need to pass table name and schema for `spark.catalog.createTable`.
* We have to pass path along with name and schema for `spark.catalog.createExternalTable`.
* We can use source to define file format along with applicable options. For example, if we want to create a table for CSV, then source will be csv and we can pass applicable options for CSV such as sep, header etc.

Let us start spark context for this Notebook so that we can execute the code provided.

If you want to use terminal for the practice, here is the command to use.

```
spark2-shell \
  --master yarn \
  --name "Joining Data Sets" \
  --conf spark.ui.port=0
```

In [None]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    appName("Spark Metastore").
    master("yarn").
    getOrCreate()

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "2")

In [None]:
import spark.implicits._

### Tasks

Let us perform tasks to create empty table using `spark.catalog.createTable` or using `spark.catalog.createExternalTable`.

* Create database **hr_db** and table **employees** with following fields. Let us create Database first and then we will see how to create table.
  * employee_id of type Integer
  * first_name of type String
  * last_name of type String
  * salary of type Float
  * nationality of type String

In [None]:
import getpass
username = getpass.getuser()

In [None]:
spark.sql(f"CREATE DATABASE IF NOT EXISTS {username}_hr_db")

In [None]:
spark.catalog.setCurrentDatabase(f"{username}_hr_db")

* Build StructType object using StructField list.

In [None]:
from pyspark.sql.types import StructField, StructType,
    IntegerType, StringType, FloatType

In [None]:
employeesSchema = StructType([
    StructField("employee_id", IntegerType()),
    StructField("first_name", StringType()),
    StructField("last_name", StringType()),
    StructField("salary", FloatType()),
    StructField("nationality", StringType())
])

In [None]:
spark.read.schema?

* Create table by passing StructType object as schema.

In [None]:
spark.catalog.createTable("employees", schema=employeesSchema)

* List the tables from database created.

In [None]:
spark.catalog.listTables()

* Create database by name **{username}_airlines** and create external table for **airport-codes.txt**.
  * Data have header
  * Fields in each record are delimited by a tab character.
  * We can pass options such as sep, header, inferSchema etc to define the schema.


In [None]:
spark.catalog.createExternalTable?

In [None]:
import getpass
username = getpass.getuser()

In [None]:
spark.sql(f"CREATE DATABASE IF NOT EXISTS {username}_airlines")

In [None]:
spark.catalog.setCurrentDatabase(f"{username}_airlines")

* To create external table, we need to have write permissions over the path which we want to use.
* As we have only read permissions on **/public/airlines_all/airport-codes** we cannot use that path while creating external table.
* Let us copy the data to **/user/`whoami`/airlines_all/airport-codes**

In [None]:
%%sh

hdfs dfs -mkdir -p /user/`whoami`/airlines_all
hdfs dfs -cp /public/airlines_all/airport-codes /user/`whoami`/airlines_all

In [None]:
import getpass
username = getpass.getuser()

airport_codes_path = f"/user/{username}/airlines_all/airport-codes"

In [None]:
spark.catalog.
    createExternalTable("airport_codes",
                        path=airport_codes_path,
                        source="csv",
                        sep="\t",
                        header="true",
                        inferSchema="true"
                       )

In [None]:
spark.catalog.listTables()

In [None]:
spark.read.table("airport_codes").show()