# Data Ingestion: COPY INTO
## Databricks Zero to Hero

**Objective:** Learn how to ingest data into the Data Lakehouse using the SQL `COPY INTO` command. We will explore its capabilities regarding idempotency, schema evolution, and simple data transformations during ingestion.

### What is COPY INTO?
`COPY INTO` is a SQL command used to load data from a file location into a Delta table.
*   **Idempotent:** It tracks which files have already been loaded. If you run it again, it will **not** duplicate data. It provides "Exactly-Once" processing.
*   **Retriable:** If a job fails, you can safely rerun it.
*   **Versatile:** Supports CSV, JSON, Avro, Parquet, ORC, etc.
*   **Simple:** Best suited for batch loads containing thousands of files. (For millions of files, **Auto Loader** is preferred).

## 1. Environment Setup
We need a location to stage our raw files. We will create a **Managed Volume** and copy some sample data from Databricks datasets.

In [None]:
# 1. Create a Managed Volume for landing data
%sql
CREATE VOLUME IF NOT EXISTS dev.bronze.landing
COMMENT "This is Landing Managed Volume for Raw Data"

In [None]:
# 2. Prepare the input directory and sample data
# We will copy two CSV files from Databricks datasets to our volume

# Create input folder
dbutils.fs.mkdirs("/Volumes/dev/bronze/landing/input")

# Copy sample Invoice data (Day 1 and Day 2)
source_path = "dbfs:/databricks-datasets/definitive-guide/data/retail-data/by-day/"
target_path = "/Volumes/dev/bronze/landing/input/"

# Copying 2 specific files
dbutils.fs.cp(f"{source_path}2010-12-01.csv", f"{target_path}2010-12-01.csv")
dbutils.fs.cp(f"{source_path}2010-12-02.csv", f"{target_path}2010-12-02.csv")

# Verify files
display(dbutils.fs.ls(target_path))

## 2. Basic Ingestion (Schema Evolution)
We will create a **placeholder table** (an empty Delta table with no schema defined). We will rely on `COPY INTO` to infer the schema from the source files and evolve the target table structure automatically.

In [None]:
%sql
-- Create a placeholder table without schema
DROP TABLE IF EXISTS dev.bronze.invoice_cp;

CREATE TABLE dev.bronze.invoice_cp;

In [None]:
%sql
-- Load data using COPY INTO
-- We use 'mergeSchema' to allow the table to adapt to the CSV columns

COPY INTO dev.bronze.invoice_cp
FROM '/Volumes/dev/bronze/landing/input'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS (
    'header' = 'true',
    'mergeSchema' = 'true' -- Infer schema from file
)
COPY_OPTIONS (
    'mergeSchema' = 'true' -- Update target table schema
);

In [None]:
%sql
-- Verify the loaded data
SELECT * FROM dev.bronze.invoice_cp;

## 3. Idempotency Check (Exactly Once)
One of the biggest features of `COPY INTO` is that it remembers state. If we run the exact same command again, it should **skip** the files it has already processed.

In [None]:
%sql
-- Rerunning the exact same command
COPY INTO dev.bronze.invoice_cp
FROM '/Volumes/dev/bronze/landing/input'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ('header' = 'true', 'mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

-- RESULT EXPECTATION: num_affected_rows should be 0

### How does it know?
Databricks maintains an internal log within the Delta Log directory called `_copy_into_log`. This metadata tracks processed file names and their states.

In [None]:
%sql
-- Check table metadata to find physical location
DESCRIBE EXTENDED dev.bronze.invoice_cp;

In [None]:
# Let's inspect the underlying log (Python)
# Replace <path_from_above_command> with the Location you see in the result above
# Example: dbfs:/user/hive/warehouse/dev.db/bronze.db/invoice_cp

# path = "<your_table_location>/_delta_log/_copy_into_log"
# display(dbutils.fs.ls(path))

## 4. Transformations & Custom Schema
Often, you don't want `SELECT *`. You might want to:
1.  Load specific columns.
2.  Change data types (e.g., String to Double).
3.  Add metadata columns (e.g., Ingestion Timestamp).

Let's create a target table with a specific schema and load data with transformations.

In [None]:
%sql
-- Create a table with defined schema
DROP TABLE IF EXISTS dev.bronze.invoice_cp_alt;

CREATE TABLE dev.bronze.invoice_cp_alt (
    InvoiceNo STRING,
    StockCode STRING,
    Quantity DOUBLE,
    _insert_date TIMESTAMP
);

In [None]:
%sql
-- COPY INTO with Transformations using SELECT
COPY INTO dev.bronze.invoice_cp_alt
FROM (
    SELECT 
        InvoiceNo, 
        StockCode, 
        cast(Quantity as DOUBLE) as Quantity, -- Casting type
        current_timestamp() as _insert_date   -- Adding custom column
    FROM '/Volumes/dev/bronze/landing/input'
)
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS (
    'header' = 'true',
    'mergeSchema' = 'true'
);

In [None]:
%sql
-- Verify data (Should have 4 columns and correct types)
SELECT * FROM dev.bronze.invoice_cp_alt;

## 5. Incremental Loading
Let's verify that `COPY INTO` picks up **new** files automatically. We will add a 3rd file to the source folder and rerun the command.

In [None]:
# Copy a new file (Day 3) to the source folder
dbutils.fs.cp(
    "dbfs:/databricks-datasets/definitive-guide/data/retail-data/by-day/2010-12-03.csv", 
    "/Volumes/dev/bronze/landing/input/2010-12-03.csv"
)

In [None]:
%sql
-- Rerun the COPY INTO command
-- It should ONLY load the records from 2010-12-03.csv
COPY INTO dev.bronze.invoice_cp_alt
FROM (
    SELECT 
        InvoiceNo, 
        StockCode, 
        cast(Quantity as DOUBLE) as Quantity, 
        current_timestamp() as _insert_date
    FROM '/Volumes/dev/bronze/landing/input'
)
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ('header' = 'true', 'mergeSchema' = 'true');

In [None]:
%sql
-- Check record counts (Should increase incrementally)
SELECT count(*) FROM dev.bronze.invoice_cp_alt;

## Summary
*   **COPY INTO** is powerful for simple, batch-based data ingestion.
*   It handles **Idempotency** automatically (no duplicates).
*   It supports **Schema Inference** and **Evolution**.
*   It allows basic **SQL Transformations** (Casting, renaming, adding columns) during load.

### Limitation
`COPY INTO` lists all files in the directory to detect new ones. If you have **millions of files**, this listing operation becomes slow. In such cases, Databricks recommends using **Auto Loader**, which we will cover in the next session.