## CTAS Statements in Databricks
One of the key features of Delta Lake tables is their flexibility in creation. Besides using standard `CREATE TABLE` statements, Databricks supports **CTAS (Create Table As Select)** statements.

**CTAS** statements allow you to **create and populate a table in a single operation**, based on the results of a `SELECT` query. This approach automatically infers the schema, eliminating the need for manual column definitions.

**Basic Syntax**:

`CREATE TABLE table_2`<br>
`AS SELECT * FROM table_1;`<br>


## Create an initial table with sample data

In [0]:
-- Create a simple table with product sales data
DROP TABLE IF EXISTS sales_data;
CREATE TABLE sales_data (
  id INT,
  product STRING,
  amount DOUBLE,
  sale_date DATE
);

-- Insert sample records
INSERT INTO sales_data VALUES
  (1, 'Laptop', 1200.50, '2024-01-15'),
  (2, 'Monitor', 300.00, '2024-03-22'),
  (3, 'Keyboard', 75.20, '2024-07-05'),
  (4, 'Mouse', 25.00, '2023-12-01');

### Create a new table with CTAS

In [0]:
-- Create a new table populated from sales_data
DROP TABLE IF EXISTS sales_2024;
CREATE TABLE sales_2024
AS
SELECT id, product, amount, sale_date
FROM sales_data
WHERE sale_date >= '2024-01-01';


In [0]:
SELECT * FROM sales_2024

### **Selecting and Renaming Columns**

CTAS statements also let you select specific columns and rename them:

`CREATE TABLE table_2`<br>
`AS SELECT col_1, col_3 AS new_col_3`<br>
`FROM table_1;`<br>

**Create another CTAS with selected & renamed columns**

In [0]:
-- Create another table selecting and renaming columns
DROP TABLE IF EXISTS sales_summary;
CREATE TABLE sales_summary
AS
SELECT
  id,
  product AS product_name,
  amount
FROM sales_data;

In [0]:
SELECT * FROM sales_summary

### **Advanced Options**

CTAS supports options to customize table creation, such as adding comments, partitioning, or specifying storage locations.

Example with options: <br>

```
CREATE TABLE new_users
COMMENT "Contains PII"
PARTITIONED BY (city, birth_date)
LOCATION '/some/path'
AS
SELECT id, name, email, birth_date, city
FROM users;
```

**COMMENT**: Adds a description to help users understand the tableâ€™s purpose (e.g., indicating it contains Personally Identifiable Information).

**PARTITIONED BY**: Organizes data in partitions based on city and birth_date, which can improve performance on large datasets.

**LOCATION**: Specifies the storage path for the table data.

### **Considerations for Partitioning**

Partitioning can boost performance by reducing the amount of data scanned during queries.
However, be mindful of the small files problem:

- When partitions are too granular, you can end up with many tiny files.
- This can harm query performance and complicate compaction.

**Create a table with advanced options**

In [0]:
-- Create a partitioned table with comment and custom location
DROP TABLE IF EXISTS sales_partitioned;
CREATE TABLE sales_partitioned
COMMENT "Sales data partitioned by sale_date"
PARTITIONED BY (sale_date)
LOCATION 'dbfs:/table/sales_partitioned'
AS
SELECT * FROM sales_data;


In [0]:
SELECT * FROM sales_partitioned

In [0]:
%fs ls dbfs:/table/sales_partitioned

In [0]:
DESCRIBE EXTENDED sales_partitioned

### Difference Between CREATE TABLE and CTAS Statements

| Feature                | **CREATE TABLE Statement**                                                                | **CTAS (Create Table As Select) Statement**                                                                      |
| ---------------------- | ------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| **Syntax Example**     |`CREATE TABLE table_2 (col1 INT, col2 STRING, col3 DOUBLE);` | `CREATE TABLE table_2 AS SELECT col1, col2, col3 FROM table_1;`                                    |
| **Schema Declaration** | Supports **manual schema declaration** (you define columns and types explicitly)            | Does **not support manual schema declaration**. The schema is **automatically inferred** from the `SELECT` query |
| **Populating Data**    | Creates an **empty table**. You must use `INSERT INTO` or `COPY INTO` to load data          | Creates the table **with data already populated** from the `SELECT` query                                        |
| **Use Cases**          | When you want to define the schema first and load data later                                | When you want to create and fill a table in a single step                                                        |
| **Transformations**    | No data transformations at creation time                                                    | Supports simple transformations, such as selecting specific columns or renaming them                             |


### **Table Constraints**

* After creating a Delta Lake table (using `CREATE TABLE` or CTAS), you can enforce data integrity by adding constraints.
* Databricks supports two types of constraints:

  * **NOT NULL constraints**
  * **CHECK constraints**

**Adding Constraints**

* Use the `ALTER TABLE` command to define constraints:

  ```
  ALTER TABLE <table_name> ADD CONSTRAINT <constraint_name> <constraint_detail>
  ```
* Before adding constraints, ensure existing data complies with them.
* Once enforced, any new data violating the constraint will fail to write.


**Example: Adding a CHECK Constraint**

* To enforce that dates fall within 2024:

  ```
  ALTER TABLE my_table
  ADD CONSTRAINT valid_date
  CHECK (date >= '2024-01-01' AND date <= '2024-12-31')
  ```
* Here:

  * `valid_date` is the constraint name.
  * The condition requires `date` values to be within the specified range.
* Any inserts or updates with dates outside this range will be rejected, maintaining data consistency.


### Add a CHECK constraint

In [0]:
-- Enforce that sale_date is within 2024
ALTER TABLE sales_partitioned
ADD CONSTRAINT valid_sale_date
CHECK (sale_date >= '2024-01-01' AND sale_date <= '2024-12-31');

**How to Fix It**

Before adding a constraint, you must clean up existing invalid data.

**Option 1**: Delete the problematic rows

**Option 2**: Move invalid data elsewhere

If you want to preserve them, you could first insert them into another table before deleting.

In [0]:
DELETE FROM sales_partitioned
WHERE sale_date < '2024-01-01' OR sale_date > '2024-12-31';

In [0]:
SELECT * FROM sales_partitioned

**Add a CHECK constraint**

In [0]:
ALTER TABLE sales_partitioned
ADD CONSTRAINT valid_sale_date
CHECK (sale_date >= '2024-01-01' AND sale_date <= '2024-12-31');

**Test the constraint**

In [0]:
-- Try inserting invalid data to see the constraint in action
INSERT INTO sales_partitioned VALUES
  (5, 'Webcam', 150.00, '2023-11-15');


### ADD Not Null Constraint
Before adding the constraint, you can check for nulls:

In [0]:
SELECT *
FROM sales_partitioned
WHERE product IS NULL;


If there are nulls, you must either Delete them or Update them.

**Add constraint**

In [0]:
ALTER TABLE sales_partitioned
ALTER COLUMN product SET NOT NULL;

In [0]:
-- DESCRIBE TABLE sales_partitioned