# Writing to Tables

Here it is the schema for the bookstore dataset used in this notebook:

![bookstore dataset schema](../Includes/images/image1.png)

In [0]:
%run ../Includes/Copy-Datasets

In [0]:
-- Making sure the table is not created before carrying on with the notebook
DROP TABLE IF EXISTS orders

In [0]:
%python
dbutils.fs.rm("/user/hive/warehouse/orders", recurse=True)

Creating an `orders` Delta Table from parquet files:

In [0]:
CREATE TABLE orders AS
SELECT * FROM parquet.`${dataset.bookstore}/orders`

In [0]:
SELECT * FROM orders

As shown, parquet files have a well-defined schema, so it is easy to extract data correctly.

When writing two tables, it can be interested to overwrite the data in the table. Actually, there are plenty of benefits of overwriting a table instead of delete it and create it again. For example, the old version of the table still exist and can easily retrieve all data using **Time Travel**.

Additionally, overwriting a table is much faster because it does NOT need to list the directory recursively or delete any files. Besides, it is an automatic operation. Concurrent queries can still read the table while it is being overwriten.

**Due to ACID transactions guarantees, if overwriting the table fails, the table will keep its previous state**.

## Overwriting Method 1: CREATE OR REPLACE TABLE

This method fully replace the content of a table each time they are executed.

In [0]:
CREATE OR REPLACE TABLE orders AS
SELECT * FROM parquet.`${dataset.bookstore}/orders`

In [0]:
DESCRIBE HISTORY orders

The history shows:
* Version 0: CREATE TABLE AS SELECT statement
* Version 1: CREATE OR REPLACE TABLE AS SELECT

## Overwriting Method 2: INSERT OVERWRITE

This method provides a neraly identical output as above. It means data will be replace in the target table by data from the query.

Some differences with the previous method:
* It can only overwrite an existing table
* It can overwrite only the new records that match the current table schema
Therefore, it a safer technique for overwriting an existing table without the risk of modifying the table schema.

In [0]:
INSERT OVERWRITE orders
SELECT * FROM parquet.`${dataset.bookstore}/orders`

In [0]:
DESCRIBE HISTORY orders

The `OVERWRITE` operation has been recorded as a `WRITE` operation with the mode `overwrite` on the history table.

If the data being overwriten has a different schema (e.g., add a new timestamp column), an exception will be generated due to a **schema mismatch**:

In [0]:
INSERT OVERWRITE orders
SELECT *, current_timestamp() FROM parquet.`${dataset.bookstore}/orders`

So the way Databricks enforces schema on-write is the primary diference between `INSERTO OVERWRITE` and `CREATE OR REPLACE TABLE`.

## Appending Records to Tables

### Method 1: INSERT INTO

In [0]:
-- Inserting new data using an input query that queries the parquet files in the 'orders-new'
INSERT INTO orders
SELECT * FROM parquet.`${dataset.bookstore}/orders-new`

700 new records has been succesfully added to the table `orders`

In [0]:
SELECT count(*) FROM orders

`INSERT INTO` is a simple and efficient operation for inserting new data. However, it does not have any built-in guarantees to prevent inserting the same records multiple times.

By executing the query, the same records will be written to the target table resulting in duplicate records.

### Method 2: MERGE INTO

`MERGE INTO` allows to upsert data from a source table, view, or DataFrame into the target Delta table. That means `INSERT`, `UPDATE` and `DELETE` can be used with this statement.

In the following query:
* A temporary view is created of the new cusatomer data
* `MERGE` operation `INTO` customers the new changes coming from the temporary view on the `customer_id` key
  * `WHEN MATCHED` an update is carried out checking that the current row has a NULL email, while the new record does not. In that case, the email and timestamp get updated
  * `WHEN NOT MATCHED` an insert is carried out

In [0]:
CREATE OR REPLACE TEMP VIEW customers_updates AS
SELECT * FROM json.`${dataset.bookstore}/customers-json-new`;

MERGE INTO customers c
USING customers_updates u
ON c.customer_id = u.customer_id
WHEN MATCHED AND c.email IS NULL AND u.email IS NOT NULL THEN
  UPDATE SET email = u.email, updated = u.updated
WHEN NOT MATCHED THEN INSERT *

As a result:
* 100 records have been updated
* 201 records have been inserted
* 0 records have been deleted

Therefore, in a `MERGE` operation, updates, inserts and deletes are completed in a single atomic transactions. Additionally, it is a **great solution for avoiding duplicates when inserting records**.

Another example:

In [0]:
CREATE OR REPLACE TEMP VIEW books_updates
  (book_id STRING, title STRING, author STRING, category STRING, price DOUBLE)
USING CSV
OPTIONS (
  header = "true",
  path = "${dataset.bookstore}/books-csv-new",
  delimiter = ";"
);

SELECT * FROM books_updates

In this temporary view there are 5 books, but the *Computer Science* books are aid to be inserted in the database.

In [0]:
MERGE INTO books b
USING books_updates u
ON b.book_id = u.book_id AND b.title = u.title
WHEN NOT MATCHED AND u.category = 'Computer Science' THEN
  INSERT *

3 new records has been inserted as expected. By running the last statement again, the records should not be inserted:

In [0]:
MERGE INTO books b
USING books_updates u
ON b.book_id = u.book_id AND b.title = u.title
WHEN NOT MATCHED AND u.category = 'Computer Science' THEN
  INSERT *