
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# 4 - Handling CSV Ingestion with the Rescued Data Column

In this demonstration, we will focus on ingesting CSV files into Delta Lake using the `CTAS` (`CREATE TABLE AS SELECT`) pattern with the `read_files()` method and exploring the rescued data column. 

### Learning Objectives

By the end of this lesson, you will be able to:

- Ingest CSV files as Delta tables using the `CREATE TABLE AS SELECT` (CTAS) statement with the `read_files()` function.
- Define and apply an explicit schema with `read_files()` to ensure consistent and reliable data ingestion.
- Handle and inspect rescued data that does not conform to the defined schema.

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default and you have a Shared SQL warehouse.

<!-- ![Select Cluster](./Includes/images/selecting_cluster_info.png) -->

Follow these steps to select the classic compute cluster:


1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.

   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.


## A. Classroom Setup

Run the following cell to configure your working environment for this notebook.

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically reference the information needed to run the course in the lab environment.

In [0]:
%run ./Includes/Classroom-Setup-04

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


----------------------------------------------------------------------------------------
Directory /Volumes/dbacademy/ops/labuser10983516_1758894578@vocareum_com/csv_demo_files already exists. No action taken.
Directory /Volumes/dbacademy/ops/labuser10983516_1758894578@vocareum_com/json_demo_files already exists. No action taken.
Directory /Volumes/dbacademy/ops/labuser10983516_1758894578@vocareum_com/xml_demo_files already exists. No action taken.
----------------------------------------------------------------------------------------



Run the cell below to view your default catalog and schema. Notice that your default catalog is **dbacademy** and your default schema is your unique **labuser** schema.

**NOTE:** The default catalog and schema are pre-configured for you to avoid the need to specify the three-level name when writing your tables (i.e., catalog.schema.table).

In [0]:
SELECT current_catalog(), current_schema()

current_catalog(),current_schema()
dbacademy,labuser10983516_1758894578


## B. Overview of CTAS with `read_files()` for Ingestion of CSV Files

CSV (Comma-Separated Values) files are a simple text-based format for storing data, where each line represents a row and values are separated by commas.

In this demonstration, we will use CSV files imported from cloud storage. Let’s explore how to ingest these raw CSV files to Delta Lake.

### B1. Inspecting CSV files

1. List available CSV files from `dbacademy_ecommerce/v01/raw/sales-csv` directory. Confirm that 4 CSV files exist in the volume.

In [0]:
LIST '/Volumes/dbacademy_ecommerce/v01/raw/sales-csv'

path,name,size,modification_time
/Volumes/dbacademy_ecommerce/v01/raw/sales-csv/000.csv,000.csv,719280,1726173041000
/Volumes/dbacademy_ecommerce/v01/raw/sales-csv/001.csv,001.csv,672376,1726173041000
/Volumes/dbacademy_ecommerce/v01/raw/sales-csv/002.csv,002.csv,621629,1726173041000
/Volumes/dbacademy_ecommerce/v01/raw/sales-csv/003.csv,003.csv,392295,1726173042000


2. Query the CSV files by path in the `/Volumes/dbacademy_ecommerce/v01/raw/sales-csv`volume directly and view the results. Notice the following:

   - The data files include a header row containing the column names.

   - The columns are delimited by the pipe character (`|`). 

     For example, the first row reads:  
     ```order_id|email|transactions_timestamp|total_item_quantity|purchase_revenue_in_usd|unique_items|items```

     The pipe (`|`) indicates column separation, meaning there are seven columns:  
     **order_id**, **email**, **transactions_timestamp**, **total_item_quantity**, **purchase_revenue_in_usd**, **unique_items**, and **items**.

In [0]:
SELECT * 
FROM csv.`/Volumes/dbacademy_ecommerce/v01/raw/sales-csv`
LIMIT 5;

_c0
order_id|email|transactions_timestamp|total_item_quantity|purchase_revenue_in_usd|unique_items|items
298592|sandovalaustin@holder.com|1592629288475307|1|850.5|1|[{'coupon': 'NEWBED10'
299024|msmith@monroe.com|1592636869915092|2|1092.6|2|[{'coupon': 'NEWBED10'
300048|robertstimothy@hotmail.com|1592649862529478|1|1075.5|1|[{'coupon': 'NEWBED10'
298711|lovejamie@yahoo.com|1592631406799948|1|850.5|1|[{'coupon': 'NEWBED10'


3. Run the cell below to query the CSV files using the default options in the `read_files` function.

   Review the results. Notice that the CSV files were **not** queried correctly in the table output.

   To fix this, we’ll need to provide additional options to the `read_files()` function for proper ingestion of CSV files.

In [0]:
SELECT * 
FROM read_files(
        "/Volumes/dbacademy_ecommerce/v01/raw/sales-csv",
        format => "csv"
      )
LIMIT 5;

order_id|email|transactions_timestamp|total_item_quantity|purchase_revenue_in_usd|unique_items|items,_rescued_data
298592|sandovalaustin@holder.com|1592629288475307|1|850.5|1|[{'coupon': 'NEWBED10',
299024|msmith@monroe.com|1592636869915092|2|1092.6|2|[{'coupon': 'NEWBED10',
300048|robertstimothy@hotmail.com|1592649862529478|1|1075.5|1|[{'coupon': 'NEWBED10',
298711|lovejamie@yahoo.com|1592631406799948|1|850.5|1|[{'coupon': 'NEWBED10',
301760|jennifer7054@gmail.com|1592661071882666|1|940.5|1|[{'coupon': 'NEWBED10',


### B2. Using CSV Options with `read_files()`

1. The code in the next cell ingests the CSV files using the `read_files()` function with some additional options.

   In this example, we are using the following options with the `read_files()` function:    [CSV options](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files#csv-options)

   - The first argument specifies the path to the CSV files.

   - `format => "csv"` — Indicates that the files are in CSV format.

   - `sep => "|"` — Specifies that columns are delimited by the pipe (`|`) character.

   - `header => true` — Tells the reader to use the first row as column headers.
   
   - Although we're using CSV files in this demonstration, other file types (like JSON or Parquet) can also be used by specifying different options.

   Run the cell and view the results. Notice the CSV files were read correctly, and a new column named **_rescued_data** appeared at the end of the result table.

**NOTE:** A **_rescued_data** column is automatically included to capture any data that doesn't match the inferred or provided schema.

In [0]:
SELECT * 
FROM read_files(
        "/Volumes/dbacademy_ecommerce/v01/raw/sales-csv",
        format => "csv",
        sep => "|",
        header => true
      )
LIMIT 5;

order_id,email,transactions_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items,_rescued_data
298592,sandovalaustin@holder.com,1592629288475307,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]",
299024,msmith@monroe.com,1592636869915092,2,1092.6,2,"[{'coupon': 'NEWBED10', 'item_id': 'M_PREM_T', 'item_name': 'Premium Twin Mattress', 'item_revenue_in_usd': 985.5, 'price_in_usd': 1095.0, 'quantity': 1}, {'coupon': 'NEWBED10', 'item_id': 'P_DOWN_S', 'item_name': 'Standard Down Pillow', 'item_revenue_in_usd': 107.10000000000001, 'price_in_usd': 119.0, 'quantity': 1}]",
300048,robertstimothy@hotmail.com,1592649862529478,1,1075.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_K', 'item_name': 'Standard King Mattress', 'item_revenue_in_usd': 1075.5, 'price_in_usd': 1195.0, 'quantity': 1}]",
298711,lovejamie@yahoo.com,1592631406799948,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]",
301760,jennifer7054@gmail.com,1592661071882666,1,940.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_Q', 'item_name': 'Standard Queen Mattress', 'item_revenue_in_usd': 940.5, 'price_in_usd': 1045.0, 'quantity': 1}]",


2. Now that we’ve successfully queried the CSV files using `read_files()`, let’s use a CTAS (`CREATE TABLE AS SELECT`) statement with the same query to complete the following:
    - Create a Delta table named **sales_bronze**. 
    - Add an ingestion timestamp and ingestion metadata columns to our **sales_bronze** table.

        - **Ingestion Timestamp:** To record when the data was ingested, use the [`current_timestamp()`](https://docs.databricks.com/aws/en/sql/language-manual/functions/current_timestamp) function. It returns the current timestamp at the start of query execution and is useful for tracking ingestion time.

        - **Metadata Columns:** To include file metadata, use the [`_metadata`](https://docs.databricks.com/en/ingestion/file-metadata-column.html) column, which is available for all input file formats. This hidden column allows access to various metadata attributes from the input files.
            - Use `_metadata.file_modification_time` to capture the last modification time of the input file.
            - Use `_metadata.file_name` to capture the name of the input file.
            - [File metadata column](https://docs.databricks.com/gcp/en/ingestion/file-metadata-column)




    Run the cell and review the results. You should see that the **sales_bronze** table was created successfully with the CSV data and additional metadata columns.

In [0]:
-- Drop the table if it exists for demonstration purposes
DROP TABLE IF EXISTS sales_bronze;


-- Create the Delta table
CREATE TABLE sales_bronze AS
SELECT 
  *,
  _metadata.file_modification_time AS file_modification_time,
  _metadata.file_name AS source_file, 
  current_timestamp() as ingestion_time 
FROM read_files(
        "/Volumes/dbacademy_ecommerce/v01/raw/sales-csv",
        format => "csv",
        sep => "|",
        header => true
      );


-- Display the table
SELECT *
FROM sales_bronze

order_id,email,transactions_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items,_rescued_data,file_modification_time,source_file,ingestion_time
298592,sandovalaustin@holder.com,1592629288475307,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]",,2024-09-12T20:30:41Z,000.csv,2025-09-26T16:24:26.521416Z
299024,msmith@monroe.com,1592636869915092,2,1092.6,2,"[{'coupon': 'NEWBED10', 'item_id': 'M_PREM_T', 'item_name': 'Premium Twin Mattress', 'item_revenue_in_usd': 985.5, 'price_in_usd': 1095.0, 'quantity': 1}, {'coupon': 'NEWBED10', 'item_id': 'P_DOWN_S', 'item_name': 'Standard Down Pillow', 'item_revenue_in_usd': 107.10000000000001, 'price_in_usd': 119.0, 'quantity': 1}]",,2024-09-12T20:30:41Z,000.csv,2025-09-26T16:24:26.521416Z
300048,robertstimothy@hotmail.com,1592649862529478,1,1075.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_K', 'item_name': 'Standard King Mattress', 'item_revenue_in_usd': 1075.5, 'price_in_usd': 1195.0, 'quantity': 1}]",,2024-09-12T20:30:41Z,000.csv,2025-09-26T16:24:26.521416Z
298711,lovejamie@yahoo.com,1592631406799948,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]",,2024-09-12T20:30:41Z,000.csv,2025-09-26T16:24:26.521416Z
301760,jennifer7054@gmail.com,1592661071882666,1,940.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_Q', 'item_name': 'Standard Queen Mattress', 'item_revenue_in_usd': 940.5, 'price_in_usd': 1045.0, 'quantity': 1}]",,2024-09-12T20:30:41Z,000.csv,2025-09-26T16:24:26.521416Z
302809,ywhite@kane.org,1592665563660982,1,1075.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_K', 'item_name': 'Standard King Mattress', 'item_revenue_in_usd': 1075.5, 'price_in_usd': 1195.0, 'quantity': 1}]",,2024-09-12T20:30:41Z,000.csv,2025-09-26T16:24:26.521416Z
309136,karen61@hotmail.com,1592689638083947,1,1795.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_PREM_K', 'item_name': 'Premium King Mattress', 'item_revenue_in_usd': 1795.5, 'price_in_usd': 1995.0, 'quantity': 1}]",,2024-09-12T20:30:41Z,000.csv,2025-09-26T16:24:26.521416Z
303941,deborah18@conrad-gallagher.com,1592669885794924,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]",,2024-09-12T20:30:41Z,000.csv,2025-09-26T16:24:26.521416Z
305920,khanedwin@gmail.com,1592676863608194,1,1075.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_K', 'item_name': 'Standard King Mattress', 'item_revenue_in_usd': 1075.5, 'price_in_usd': 1195.0, 'quantity': 1}]",,2024-09-12T20:30:41Z,000.csv,2025-09-26T16:24:26.521416Z
298795,samantha4354@hotmail.com,1592632916516773,1,985.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_PREM_T', 'item_name': 'Premium Twin Mattress', 'item_revenue_in_usd': 985.5, 'price_in_usd': 1095.0, 'quantity': 1}]",,2024-09-12T20:30:41Z,000.csv,2025-09-26T16:24:26.521416Z


3. View the column data types of the **sales_bronze** table. Notice that the `read_files()` function automatically infers the schema if one is not explicitly provided.

      **NOTE:** When the schema is not provided, `read_files()` attempts to infer a unified schema across the discovered files, which requires reading all the files unless a LIMIT statement is used. Even when using a LIMIT query, a larger set of files than required might be read to return a more representative schema of the data.

     - [Schema inference](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files#csv-options)

In [0]:
DESCRIBE TABLE EXTENDED sales_bronze;

### B3. (BONUS) Python Equivalent

In [0]:
%python

df = (spark
      .read 
      .option("header", True) 
      .option("sep","|") 
      .option("rescuedDataColumn", "_rescued_data")       # <--------- Add the rescued data column
      .csv("/Volumes/dbacademy_ecommerce/v01/raw/sales-csv")
    )

df.display()

## C. Troubleshooting Common CSV Issues


1. To begin, let’s quickly explore your data source raw files volume. Complete the following steps to view your course volume in **dbacademy.ops.labuser**:

   a. In the left navigation bar, select the catalog icon ![Catalog Icon](./Includes/images/catalog_icon.png).

   b. Expand the **dbacademy** catalog.

   c. Expand the **ops** schema.

   d. Expand **Volumes**. You should see a volume with your **labuser** name, which contains the source data to ingest.

   e. Expand your **labuser** volume. Notice that this volume contains a series of subdirectories. We will be using the **csv_demo_files** directory in your volume.

   f. Expand the **csv_demo_files** subdirectory. Notice that it contains the files:
      - **malformed_example_1_data.csv**
      - **malformed_example_2_data.csv**

2. In the cell below, view the value of the SQL variable `DA.paths_working_dir`. This variable will reference the path to your **labuser** volume from above, as each user has a different source volume. This variable is created within the classroom setup script to dynamically reference your unique volume.

   Run the cell and review the results. You’ll notice that the `DA.paths_working_dir` variable points to your `/Volumes/dbacademy/ops/labuser` volume.

**NOTE:** Instead of using the `DA.paths_working_dir` variable, you could also specify the path name directly by right clicking on your volume and selecting **Copy volume path**.

In [0]:
values(DA.paths_working_dir)

3. You can concatenate the `DA.paths_working_dir` SQL variable with a string to specify a specific subdirectory in your specific volume.

   Run the cell below and review the results. You’ll notice that it returns the path to your **malformed_example_1_data.csv** file. This method will be used when referencing your volume within the `read_files` function.

In [0]:
values(DA.paths_working_dir || '/csv_demo_files/malformed_example_1_data.csv')

### C1. Defining a Schema During Ingestion

We want to read the CSV file into the bronze table using a defined schema.

**Explicit schemas benefits:**
- Reduce the risk of inferred schema inconsistencies, especially with semi-structured data like JSON or CSV.
- Enable faster parsing and loading of data, as Spark can immediately apply the correct types and structure without inferring the schema.
- Improve performance with large datasets by significantly reducing compute overhead.


1. The query below will reference the **malformed_example_1_data.csv** file. This will allow you to view the CSV file as text for inspection.

   Run the query and review the results. Notice the following:

   - The CSV file is **|** delimited.

   - The CSV file contains headers.
   
**NOTE:** The **transactions_timestamp** column contains the string *aaa* in the first row, which will cause issues during ingestion when attempting read the **transactions_timestamp** column as a BIGINT.

In [0]:
%python
spark.sql(f'''
    SELECT *
    FROM text.`{DA.paths.working_dir}/csv_demo_files/malformed_example_1_data.csv`
''').display()

2. Use the `read_files` function to see how this CSV file is read into the table. Run the cell and view the results. 

    **IMPORTANT** Notice that the malformed value *aaa* in the **transactions_timestamp** column causes the column to be read as a STRING. However, we want the **transactions_timestamp** column to be read into the bronze table as a BIGINT.


In [0]:
SELECT *
FROM read_files(
        DA.paths_working_dir || '/csv_demo_files/malformed_example_1_data.csv',
        format => "csv",
        sep => "|",
        header => true
      );

3. You can define a schema for the `read_files()` function to read in the data with a specific structure.

   a. Use the `schema` option to define the schema. In this case, we'll read in the following:
   - **order_id** as INT  
   - **email** as STRING  
   - **transactions_timestamp** as BIGINT

   b. Use the `rescuedDataColumn` option to collect all data that can’t be parsed due to data type mismatches or schema mismatches into a separate column for review.

   Run the cell and review the results. Notice that row 1 (*aaa*) could not be read using the defined schema, so it was placed in the **_rescued_data** column. Keeping rows that don’t conform to the schema allows you to inspect and process them as needed.

**NOTE:** Defining a schema when using `read_files` in Databricks improves performance by skipping the expensive schema inference step and ensures consistent, reliable data parsing. It's especially beneficial for large or semi-structured datasets.

In [0]:
SELECT *
FROM read_files(
        DA.paths_working_dir || '/csv_demo_files/malformed_example_1_data.csv',
        format => "csv",
        sep => "|",
        header => true,
        schema => '''
            order_id INT, 
            email STRING, 
            transactions_timestamp BIGINT''', 
        rescueddatacolumn => '_rescued_data'    -- Create the _rescued_data column
      );

#### Summary: Rescued Data Column

The rescued data column ensures that rows that don’t match with the schema are rescued instead of being dropped. The rescued data column contains any data that isn’t parsed for the following reasons:
- The column is missing from the schema.
- Type mismatches
- Case mismatches

### C2. Handling Missing Headers During Ingestion 
In this example, the CSV file contains a missing header by mistake.

1. Run the cell below to view the **malformed_example_2_data.csv** file. Notice that the first row contains headers, but the first column name is missing.

In [0]:
%python
spark.sql(f'''
  SELECT *
  FROM text.`{DA.paths.working_dir}/csv_demo_files/malformed_example_2_data.csv`
''').display()

2. Let's try to create a table using the **malformed_example_2_data.csv** file with the `read_files()` function. Run the cell and review the results.

    Notice the following:
    - The first column of the CSV file was not read into the table as a standard column and was instead placed in the **_rescued_data** column.

    - The **_rescued_data** column stores the rescued data as a JSON-formatted string.
    
    - When inspecting the **_rescued_data** JSON-formatted string, you'll see that the unnamed column from the CSV file is represented with a key of **_c0**, which contains the column value as a string, along with a **_file_path** key.

In [0]:
-- Drop the table if it exists for demonstration purposes
DROP TABLE IF EXISTS demo_4_example_2_bronze;

-- Create Delta table by ingesting CSV file
CREATE OR REPLACE TABLE demo_4_example_2_bronze AS
SELECT *
FROM read_files(
        DA.paths_working_dir || '/csv_demo_files/malformed_example_2_data.csv',
        format => "csv",
        sep => "|",
        header => true
      );


-- Display the table
SELECT *
FROM demo_4_example_2_bronze;

3. The **_rescued_data** column is a JSON-formatted string. We won’t go into detail on how to handle this type of data here, as it will be covered in a later demo and lab.

    However, it's important to note that you can extract values from the **_rescued_data** column and add them to your bronze table. To obtain the value from the **_c0** field, you can use the `_rescued_data:_c0` syntax, as shown in the next cell.

    **NOTE:** The output from running the next cell returns **order_id** as the rescued column.

In [0]:
SELECT
  cast(_rescued_data:_c0 AS BIGINT) AS order_id,
  *
FROM read_files(
        DA.paths_working_dir || '/csv_demo_files/malformed_example_2_data.csv',
        format => "csv",
        sep => "|",
        header => true
      )


&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="blank">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy" target="blank">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use" target="blank">Terms of Use</a> | 
<a href="https://help.databricks.com/" target="blank">Support</a>