# Auto Loader

Here it is the schema for the bookstore dataset used in this notebook:

![bookstore dataset schema](../Includes/images/image1.png)

In [0]:
%run ../Includes/Copy-Datasets

## Exploring The Source Directory

In this demo, new data will be ingested from orders received in Parquet files.

In [0]:
files = dbutils.fs.ls(f"{dataset_bookstore}/orders-raw")
display(files)

There is only one parquet file in this directory.

## Auto Loader

Using Auto Loader to read the current file in this directory and detect new files as they arrive to ingest them into a target table.

To work with Auto Loader, `readStream` and `writeStream` methods from Spark structured streaming API are used.
`readStream` parameters:
* Format is `cloudFiles` indicating that this is an Auto Loader stream
* `cloudFile.format`: reading data files of parquet format
* `schemaLocation`: a directory in which Auto Loader can store the information of the inferred schema
* `.load()`: location of data source files

`writeStream` parameters:
* `checkpointLocation`: allows Auto Loader to track the ingestion process
* `.table()`: write data into a target table

The same directory is used for storing both the schema and checkpoints.

In [0]:
(spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "parquet")
        .option("cloudFiles.schemaLocation", "dbfs:/mnt/demo/orders_checkpoint")
        .load(f"{dataset_bookstore}/orders-raw")
      .writeStream
        .option("checkpointLocation", "dbfs:/mnt/demo/orders_checkpoint")
        .table("orders_updates")
)

Auto Loader is a streaming query since it uses Spark structured streaming to load data incrementally. This query will be contninuosly active, and as soon as the new data arrives in the database, it will be processed and loaded into the target table.

Once the data has been ingested to Delta Lake by Auto Loader, it can be queried the same way as any table.

In [0]:
%sql
SELECT * FROM orders_updates

In [0]:
%sql
SELECT COUNT(*) FROM orders_updates

There are 1000 records.

## Landing New Files

## Exploring Table History

Copying new files into the directory:

In [0]:
# Helper function coming with the bookstore dataset
load_new_data()

A new file has been landed in the source directory.

Let's do it again:

In [0]:
load_new_data()

In [0]:
files = dbutils.fs.ls(f"{dataset_bookstore}/orders-raw")
display(files)

Two additional files has been loaded to the directory. As the streaming process is still active, it has processed this new files:

![](Screenshot 2025-05-28 151500.png)

Auto Loader has deceted that there were two new files in the directory and has processed them.

In [0]:
%sql
SELECT COUNT(*) FROM orders_updates

2000 records have been added to the original table.

In [0]:
%sql
DESCRIBE HISTORY orders_updates

There are three streaming updates.

## Cleaning Up

In [0]:
%sql
DROP TABLE orders_updates

Removing the checkpoint location:

In [0]:
dbutils.fs.rm("dbfs:/mnt/demo/orders_checkpoint", True)