
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


## DEMO - Build and Register a RAG Model

A Retrieval-Augmented Generation (RAG) model is a widely used architecture for generative AI applications, particularly when contextual information needs to accompany the prompt. **In this demo, we will construct a RAG pipeline and register it in the Unity Catalog model registry.**

The RAG pipeline will function as a simple **product design chatbot**. As contextual information, we will provide product descriptions that have previously **listed on Etsy website**. Please note that the dataset **we will use is publicly available, and the large language model (LLM) might already include this data in its training set**. Consequently, the quality of responses with and without contextual data might not differ significantly. In a real-world scenario, the contextual data would include information that is new to the LLM.

**Learning Objectives:**

By the end of this demo, you will be able to:

- Inspect the Vector Search endpoint and index using the UI.

- Set up a Vector Search index using an existing Delta table.

- Retrieve documents from the vector store using similarity search.

- Assemble a RAG pipeline by integrating various components.

- Register a RAG pipeline in the Model Registry.

## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.
   
   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.

## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **16.2.x-cpu-ml-scala2.12**



## Classroom Setup

Install required libraries.

In [0]:
%pip install -qq -U databricks-sdk langchain-databricks databricks-vectorsearch langchain==0.3.7 langchain-community==0.3.7
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
%run ../Includes/Classroom-Setup-02


The examples and models presented in this course are intended solely for demonstration and educational purposes.
 Please note that the models and prompt examples may sometimes contain offensive, inaccurate, biased, or harmful content.


'labuser10813094_1751481371'

In [0]:
# assign vs search endpoint by username
vs_endpoint_prefix = "vs_endpoint_"
vs_endpoint_name = vs_endpoint_prefix+str(get_fixed_integer(DA.unique_name("_")))
vs_source_table_name = f"{DA.catalog_name}.{DA.schema_name}.product_text"
vs_index_table_name = f"{DA.catalog_name}.{DA.schema_name}.product_embeddings"

print(f"=== Variables that you will need for this demo === \n")
print(f"Catalog Name                : {DA.catalog_name}\n")
print(f"Schema Name                 : {DA.schema_name}\n")
print(f"Assigned VS endpoint name   : {vs_endpoint_name} \n")
print(f"VS source table name        : {vs_source_table_name} \n")
print(f"VS index table name         : {vs_index_table_name } \n")

=== Variables that you will need for this demo === 

Catalog Name                : dbacademy

Schema Name                 : labuser10813094_1751481371

Assigned VS endpoint name   : vs_endpoint_4 

VS source table name        : dbacademy.labuser10813094_1751481371.product_text 

VS index table name         : dbacademy.labuser10813094_1751481371.product_embeddings 



## Demo Overview

The initial component of this demonstration is the **retrieval component**. Based on the input query, we will search for and retrieve similar product descriptions.

Next, we will construct the entire pipeline using LangChain. **Please note that LangChain is not within the scope of this course. For more information, we recommend referring to the "Generative AI Engineering with Databricks" course.**



## Load Dataset

Before you start building the AI chain, you need to load and prepare the dataset and save it as a Delta table.  
For this demo, we will use the **[Databricks Documentation Dataset](/marketplace/consumer/listings/03bbb5c0-983d-4523-833a-57e994d76b3b?o=1120757972560637)** available from the Databricks Marketplace.

This dataset contains documentation pages with associated `id`, `url`, and `content`.  
We will format the data to create a single unified `document` field combining the URL and content, which will then be used to build a Vector Store.

The table will be created for you in the next code block.

In [0]:
DA.validate_table("dbacademy_docs.v01.docs")

# Read dataset from marketplace
dataset = spark.sql(f"""
        SELECT 
            id AS id, 
            CONCAT('## URL: ', url, '\n\n## Content: ', content) AS document 
        FROM dbacademy_docs.v01.docs
    """)

# Create a delta table for the dataset
vvs_source_table_name = f"{DA.catalog_name}.{DA.schema_name}.docs"
dataset.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(vs_source_table_name)

# Enable Change Data Feed for Delta table
spark.sql(f"ALTER TABLE {vs_source_table_name} SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

display(spark.sql(f"SELECT * FROM {vs_source_table_name}"))

Validation of table dbacademy_docs.v01.docs complete. No errors found.


id,document
25277,"## URL: https://docs.databricks.com/en/ingestion/bad-records.html ## Content: Handle bad records and files Databricks provides a number of options for dealing with files that contain bad records. Examples of bad data include: Incomplete or corrupt records: Mainly observed in text based file formats like JSON and CSV. For example, a JSON record that doesn’t have a closing brace or a CSV record that doesn’t have as many columns as the header or first record of the CSV file. Mismatched data types: When the value for a column doesn’t have the specified or inferred data type. Bad field names: Can happen in all file formats, when the column name specified in the file or record has a different casing than the specified or inferred schema. Corrupted files: When a file cannot be read, which might be due to metadata or data corruption in binary file types such as Avro, Parquet, and ORC. On rare occasion, might be caused by long-lasting transient failures in the underlying storage system. Missing files: A file that was discovered during query analysis time and no longer exists at processing time. Use badRecordsPath Use badRecordsPath When you set badRecordsPath, the specified path records exceptions for bad records or files encountered during data loading. In addition to corrupt records and files, errors indicating deleted files, network connection exception, IO exception, and so on are ignored and recorded under the badRecordsPath. Note Using the badRecordsPath option in a file-based data source has a few important limitations: It is non-transactional and can lead to inconsistent results. Transient errors are treated as failures. Unable to find input file Unable to find input file val df = spark.read .option(""badRecordsPath"", ""/tmp/badRecordsPath"") .format(""parquet"").load(""/input/parquetFile"") // Delete the input parquet file '/input/parquetFile' dbutils.fs.rm(""/input/parquetFile"") df.show() In the above example, since df.show() is unable to find the input file, Spark creates an exception file in JSON format to record the error. For example, /tmp/badRecordsPath/20170724T101153/bad_files/xyz is the path of the exception file. This file is under the specified badRecordsPath directory, /tmp/badRecordsPath. 20170724T101153 is the creation time of this DataFrameReader. bad_files is the exception type. xyz is a file that contains a JSON record, which has the path of the bad file and the exception/reason message. Input file contains bad record Input file contains bad record // Creates a json file containing both parsable and corrupted records Seq(""""""{""a"": 1, ""b"": 2}"""""", """"""{bad-record"""""").toDF().write.format(""text"").save(""/tmp/input/jsonFile"") val df = spark.read .option(""badRecordsPath"", ""/tmp/badRecordsPath"") .schema(""a int, b int"") .format(""json"") .load(""/tmp/input/jsonFile"") df.show() In this example, the DataFrame contains only the first parsable record ({""a"": 1, ""b"": 2}). The second bad record ({bad-record) is recorded in the exception file, which is a JSON file located in /tmp/badRecordsPath/20170724T114715/bad_records/xyz. The exception file contains the bad record, the path of the file containing the record, and the exception/reason message. After you locate the exception files, you can use a JSON reader to process them."
25278,"## URL: https://docs.databricks.com/en/ingestion/copy-into/configure-data-access.html ## Content: Configure data access for ingestion This article describes how admin users can configure access to data in a bucket in Amazon S3 (S3) so that Databricks users can load data from S3 into a table in Databricks. This article describes the following ways to configure secure access to source data: (Recommended) Create a Unity Catalog volume. Create a Unity Catalog external location with a storage credential. Launch a compute resource that uses an AWS instance profile. Generate temporary credentials (an AWS access key ID, a secret key, and a session token). Before you begin Before you begin Before you configure access to data in S3, make sure you have the following: Data in an S3 bucket in your AWS account. To create a bucket, see Creating a bucket in the AWS documentation. To access data using a Unity Catalog volume (recommended), the READ VOLUME privilege on the volume. For more information, see Create and work with volumes and Unity Catalog privileges and securable objects. To access data using a Unity Catalog external location, the READ FILES privilege on the external location. For more information, see Create an external location to connect cloud storage to Databricks. To access data using a compute resource with an AWS instance profile, Databricks workspace admin permissions. A Databricks SQL warehouse. To create a SQL warehouse, see Create a SQL warehouse. Familiarity with the Databricks SQL user interface. Configure access to cloud storage Configure access to cloud storage Use one of the following methods to configure access to S3: (Recommended) Create a Unity Catalog volume. For more information, see Create and work with volumes. Configure a Unity Catalog external location with a storage credential. For more information about external locations, see Create an external location to connect cloud storage to Databricks. Configure a compute resource to use an AWS instance profile. For more information, see Configure a SQL warehouse to use an instance profile. Generate temporary credentials (an AWS access key ID, a secret key, and a session token) to share with other Databricks users. For more information, see Generate temporary credentials for ingestion. Clean up Clean up You can clean up the associated resources in your cloud account and Databricks if you no longer want to keep them. Delete the AWS CLI named profile In your ~/.aws/credentials file for Unix, Linux, and macOS, or in your %USERPROFILE%\.aws\credentials file for Windows, remove the following portion of the file, and then save the file: [] aws_access_key_id = aws_secret_access_key = Delete the IAM user Open the IAM console in your AWS account, typically at https://console.aws.amazon.com/iam. In the sidebar, click Users. Select the box next to the user, and then click Delete. Enter the name of the user, and then click Delete. Delete the IAM policy Open the IAM console in your AWS account, if it is not already open, typically at https://console.aws.amazon.com/iam. In the sidebar, click Policies. Select the option next to the policy, and then click Actions > Delete. Enter the name of the policy, and then click Delete. Delete the S3 bucket Open the Amazon S3 console in your AWS account, typically at https://console.aws.amazon.com/s3. Select the option next to the bucket, and then click Empty. Enter permanently delete, and then click Empty. In the sidebar, click Buckets. Select the option next to the bucket, and then click Delete. Enter the name of the bucket, and then click Delete bucket. Stop the SQL warehouse If you are not using the SQL warehouse for any other tasks, you should stop the SQL warehouse to avoid additional costs. In the SQL persona, on the sidebar, click SQL Warehouses. Next to the name of the SQL warehouse, click Stop. When prompted, click Stop again. Next steps Next steps After you complete the steps in this article, users can run the COPY INTO command to load the data from the S3 bucket into your Databricks workspace. To load data using a Unity Catalog volume or external location, see Load data using COPY INTO with Unity Catalog volumes or external locations. To load data using a SQL warehouse with an AWS instance profile, see Load data using COPY INTO with an instance profile. To load data using temporary credentials (an AWS access key ID, a secret key, and a session token), see Load data using COPY INTO with temporary credentials."
25279,"## URL: https://docs.databricks.com/en/ingestion/copy-into/examples.html ## Content: Common data loading patterns using COPY INTO Learn common patterns for using COPY INTO to load data from file sources into Delta Lake. There are many options for using COPY INTO. You can also use temporary credentials with COPY INTO in combination with these patterns. See COPY INTO for a full reference of all options. Create target tables for COPY INTO Create target tables for COPY INTO COPY INTO must target an existing Delta table. In Databricks Runtime 11.3 LTS and above, setting the schema for these tables is optional for formats that support schema evolution: CREATE TABLE IF NOT EXISTS my_table [(col_1 col_1_type, col_2 col_2_type, ...)] [COMMENT ] [TBLPROPERTIES ()]; Note that to infer the schema with COPY INTO, you must pass additional options: COPY INTO my_table FROM '/path/to/files' FILEFORMAT = FORMAT_OPTIONS ('inferSchema' = 'true') COPY_OPTIONS ('mergeSchema' = 'true'); The following example creates a schemaless Delta table called my_pipe_data and loads a pipe-delimited CSV with a header: CREATE TABLE IF NOT EXISTS my_pipe_data; COPY INTO my_pipe_data FROM 's3://my-bucket/pipeData' FILEFORMAT = CSV FORMAT_OPTIONS ('mergeSchema' = 'true', 'delimiter' = '|', 'header' = 'true') COPY_OPTIONS ('mergeSchema' = 'true'); Load JSON data with COPY INTO Load JSON data with COPY INTO The following example loads JSON data from five files in Amazon S3 (S3) into the Delta table called my_json_data. This table must be created before COPY INTO can be executed. If any data was already loaded from one of the files, the data isn’t reloaded for that file. COPY INTO my_json_data FROM 's3://my-bucket/jsonData' FILEFORMAT = JSON FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json') -- The second execution will not copy any data since the first command already loaded the data COPY INTO my_json_data FROM 's3://my-bucket/jsonData' FILEFORMAT = JSON FILES = ('f1.json', 'f2.json', 'f3.json', 'f4.json', 'f5.json') Load Avro data with COPY INTO Load Avro data with COPY INTO The following example loads Avro data in S3 using additional SQL expressions as part of the SELECT statement. COPY INTO my_delta_table FROM (SELECT to_date(dt) dt, event as measurement, quantity::double FROM 's3://my-bucket/avroData') FILEFORMAT = AVRO Load CSV files with COPY INTO Load CSV files with COPY INTO The following example loads CSV files from S3 under s3://bucket/base/path/folder1 into a Delta table at s3://bucket/deltaTables/target. COPY INTO delta.`s3://bucket/deltaTables/target` FROM (SELECT key, index, textData, 'constant_value' FROM 's3://bucket/base/path') FILEFORMAT = CSV PATTERN = 'folder1/file_[a-g].csv' FORMAT_OPTIONS('header' = 'true') -- The example below loads CSV files without headers in S3 using COPY INTO. -- By casting the data and renaming the columns, you can put the data in the schema you want COPY INTO delta.`s3://bucket/deltaTables/target` FROM (SELECT _c0::bigint key, _c1::int index, _c2 textData FROM 's3://bucket/base/path') FILEFORMAT = CSV PATTERN = 'folder1/file_[a-g].csv' Ignore corrupt files while loading data Ignore corrupt files while loading data If the data you’re loading can’t be read due to some corruption issue, those files can be skipped by setting ignoreCorruptFiles to true in the FORMAT_OPTIONS. The result of the COPY INTO command returns how many files were skipped due to corruption in the num_skipped_corrupt_files column. This metric also shows up in the operationMetrics column under numSkippedCorruptFiles after running DESCRIBE HISTORY on the Delta table. Corrupt files aren’t tracked by COPY INTO, so they can be reloaded in a subsequent run if the corruption is fixed. You can see which files are corrupt by running COPY INTO in VALIDATE mode. COPY INTO my_table FROM '/path/to/files' FILEFORMAT = [VALIDATE ALL] FORMAT_OPTIONS ('ignoreCorruptFiles' = 'true') Note ignoreCorruptFiles is available in Databricks Runtime 11.3 LTS and above."
25280,"## URL: https://docs.databricks.com/en/ingestion/copy-into/generate-temporary-credentials.html ## Content: Generate temporary credentials for ingestion This article describes how to create an IAM user in your AWS account that has just enough access to read data in an Amazon S3 (S3) bucket. Create an IAM policy Create an IAM policy Open the AWS IAM console in your AWS account, typically at https://console.aws.amazon.com/iam. Click Policies. Click Create Policy. Click the JSON tab. Replace the existing JSON code with the following code. In the code, replace: with the name of your S3 bucket. with the name of the folder within your S3 bucket. { ""Version"": ""2012-10-17"", ""Statement"": [ { ""Sid"": ""ReadOnlyAccessToTrips"", ""Effect"": ""Allow"", ""Action"": [ ""s3:GetObject"", ""s3:ListBucket"" ], ""Resource"": [ ""arn:aws:s3:::"", ""arn:aws:s3::://*"" ] } ] } Click Next: Tags. Click Next: Review. Enter a name for the policy and click Create policy. Create an IAM user Create an IAM user In the sidebar, click Users. Click Add users. Enter a name for the user. Select the Access key - Programmatic access box, and then click Next: Permissions. Click Attach existing policies directly. Select the box next to the policy, and then click Next: Tags. Click Next: Review. Click Create user. Copy the Access key ID and Secret access key values that appear to a secure location, as you need them to get the AWS STS session token. Create a named profile Create a named profile On your local development machine, use the AWS CLI to create a named profile with the AWS credentials that you copied in the previous step. See Named profiles for the AWS CLI on the AWS website. Test your AWS credentials. To do this, use the AWS CLI to run the following command, which displays the contents of the folder that contains your data. In the command, replace: with the name of your S3 bucket. with the name of the folder within your S3 bucket. with the name of your named profile. aws s3 ls s3://// --profile To get the session token, run the following command: aws sts get-session-token --profile Replace with the name of your named profile. Copy the AccessKeyId, SecretAccessKey, and SessionToken values that appear to a secure location."
25281,"## URL: https://docs.databricks.com/en/ingestion/copy-into/index.html ## Content: Get started using COPY INTO to load data The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and idempotent operation; files in the source location that have already been loaded are skipped. COPY INTO offers the following capabilities: Easily configurable file or directory filters from cloud storage, including S3, ADLS Gen2, ABFS, GCS, and Unity Catalog volumes. Support for multiple source file formats: CSV, JSON, XML, Avro, ORC, Parquet, text, and binary files Exactly-once (idempotent) file processing by default Target table schema inference, mapping, merging, and evolution Note For a more scalable and robust file ingestion experience, Databricks recommends that SQL users leverage streaming tables. See Load data using streaming tables in Databricks SQL. Warning COPY INTO respects the workspace setting for deletion vectors. If enabled, deletion vectors are enabled on the target table when COPY INTO runs on a SQL warehouse or compute running Databricks Runtime 14.0 or above. Once enabled, deletion vectors block queries against a table in Databricks Runtime 11.3 LTS and below. See What are deletion vectors? and Auto-enable deletion vectors. Requirements Requirements An account admin must follow the steps in Configure data access for ingestion to configure access to data in cloud object storage before users can load data using COPY INTO. Example: Load data into a schemaless Delta Lake table Example: Load data into a schemaless Delta Lake table Note This feature is available in Databricks Runtime 11.3 LTS and above. You can create empty placeholder Delta tables so that the schema is later inferred during a COPY INTO command by setting mergeSchema to true in COPY_OPTIONS: CREATE TABLE IF NOT EXISTS my_table [COMMENT ] [TBLPROPERTIES ()]; COPY INTO my_table FROM '/path/to/files' FILEFORMAT = FORMAT_OPTIONS ('mergeSchema' = 'true') COPY_OPTIONS ('mergeSchema' = 'true'); The SQL statement above is idempotent and can be scheduled to run to ingest data exactly-once into a Delta table. Note The empty Delta table is not usable outside of COPY INTO. INSERT INTO and MERGE INTO are not supported to write data into schemaless Delta tables. After data is inserted into the table with COPY INTO, the table becomes queryable. See Create target tables for COPY INTO. Example: Set schema and load data into a Delta Lake table"
25282,"## URL: https://docs.databricks.com/en/ingestion/copy-into/index.html ## Content: Example: Set schema and load data into a Delta Lake table The following example shows how to create a Delta table and then use the COPY INTO SQL command to load sample data from Databricks datasets into the table. You can run the example Python, R, Scala, or SQL code from a notebook attached to a Databricks cluster. You can also run the SQL code from a query associated with a SQL warehouse in Databricks SQL. DROP TABLE IF EXISTS default.loan_risks_upload; CREATE TABLE default.loan_risks_upload ( loan_id BIGINT, funded_amnt INT, paid_amnt DOUBLE, addr_state STRING ); COPY INTO default.loan_risks_upload FROM '/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet' FILEFORMAT = PARQUET; SELECT * FROM default.loan_risks_upload; -- Result: -- +---------+-------------+-----------+------------+ -- | loan_id | funded_amnt | paid_amnt | addr_state | -- +=========+=============+===========+============+ -- | 0 | 1000 | 182.22 | CA | -- +---------+-------------+-----------+------------+ -- | 1 | 1000 | 361.19 | WA | -- +---------+-------------+-----------+------------+ -- | 2 | 1000 | 176.26 | TX | -- +---------+-------------+-----------+------------+ -- ... table_name = 'default.loan_risks_upload' source_data = '/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet' source_format = 'PARQUET' spark.sql(""DROP TABLE IF EXISTS "" + table_name) spark.sql(""CREATE TABLE "" + table_name + "" ("" \ ""loan_id BIGINT, "" + \ ""funded_amnt INT, "" + \ ""paid_amnt DOUBLE, "" + \ ""addr_state STRING)"" ) spark.sql(""COPY INTO "" + table_name + \ "" FROM '"" + source_data + ""'"" + \ "" FILEFORMAT = "" + source_format ) loan_risks_upload_data = spark.sql(""SELECT * FROM "" + table_name) display(loan_risks_upload_data) ''' Result: +---------+-------------+-----------+------------+ | loan_id | funded_amnt | paid_amnt | addr_state | +=========+=============+===========+============+ | 0 | 1000 | 182.22 | CA | +---------+-------------+-----------+------------+ | 1 | 1000 | 361.19 | WA | +---------+-------------+-----------+------------+ | 2 | 1000 | 176.26 | TX | +---------+-------------+-----------+------------+ ... '''"
25283,"## URL: https://docs.databricks.com/en/ingestion/copy-into/index.html ## Content: library(SparkR) sparkR.session() table_name = ""default.loan_risks_upload"" source_data = ""/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"" source_format = ""PARQUET"" sql(paste(""DROP TABLE IF EXISTS "", table_name, sep = """")) sql(paste(""CREATE TABLE "", table_name, "" ("", ""loan_id BIGINT, "", ""funded_amnt INT, "", ""paid_amnt DOUBLE, "", ""addr_state STRING)"", sep = """" )) sql(paste(""COPY INTO "", table_name, "" FROM '"", source_data, ""'"", "" FILEFORMAT = "", source_format, sep = """" )) loan_risks_upload_data = tableToDF(table_name) display(loan_risks_upload_data) # Result: # +---------+-------------+-----------+------------+ # | loan_id | funded_amnt | paid_amnt | addr_state | # +=========+=============+===========+============+ # | 0 | 1000 | 182.22 | CA | # +---------+-------------+-----------+------------+ # | 1 | 1000 | 361.19 | WA | # +---------+-------------+-----------+------------+ # | 2 | 1000 | 176.26 | TX | # +---------+-------------+-----------+------------+ # ... val table_name = ""default.loan_risks_upload"" val source_data = ""/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"" val source_format = ""PARQUET"" spark.sql(""DROP TABLE IF EXISTS "" + table_name) spark.sql(""CREATE TABLE "" + table_name + "" ("" + ""loan_id BIGINT, "" + ""funded_amnt INT, "" + ""paid_amnt DOUBLE, "" + ""addr_state STRING)"" ) spark.sql(""COPY INTO "" + table_name + "" FROM '"" + source_data + ""'"" + "" FILEFORMAT = "" + source_format ) val loan_risks_upload_data = spark.table(table_name) display(loan_risks_upload_data) /* Result: +---------+-------------+-----------+------------+ | loan_id | funded_amnt | paid_amnt | addr_state | +=========+=============+===========+============+ | 0 | 1000 | 182.22 | CA | +---------+-------------+-----------+------------+ | 1 | 1000 | 361.19 | WA | +---------+-------------+-----------+------------+ | 2 | 1000 | 176.26 | TX | +---------+-------------+-----------+------------+ ... */ To clean up, run the following code, which deletes the table: spark.sql(""DROP TABLE "" + table_name) sql(paste(""DROP TABLE "", table_name, sep = """")) spark.sql(""DROP TABLE "" + table_name) DROP TABLE default.loan_risks_upload"
25284,"## URL: https://docs.databricks.com/en/ingestion/copy-into/index.html ## Content: Reference Reference Databricks Runtime 7.x and above: COPY INTO Additional resources Additional resources Load data using COPY INTO with Unity Catalog volumes or external locations Load data using COPY INTO with an instance profile For common use patterns, including examples of multiple COPY INTO operations against the same Delta table, see Common data loading patterns using COPY INTO."
25285,"## URL: https://docs.databricks.com/en/ingestion/copy-into/temporary-credentials.html ## Content: Load data using COPY INTO with temporary credentials If your Databricks cluster or SQL warehouse doesn’t have permissions to read your source files, you can use temporary credentials to access data from external cloud object storage and load files into a Delta Lake table. Depending on how your organization manages your cloud security, you might need to ask a cloud administrator or power user to provide you with credentials. For more information, see Generate temporary credentials for ingestion. Specifying temporary credentials or encryption options to access data Specifying temporary credentials or encryption options to access data Note Credential and encryption options are available in Databricks Runtime 10.4 LTS and above. COPY INTO supports: Azure SAS tokens to read data from ADLS Gen2 and Azure Blob Storage. Azure Blob Storage temporary tokens are at the container level, whereas ADLS Gen2 tokens can be at the directory level in addition to the container level. Databricks recommends using directory level SAS tokens when possible. The SAS token must have “Read”, “List”, and “Permissions” permissions. AWS STS tokens to read data from AWS S3. Your tokens should have the “s3:GetObject*”, “s3:ListBucket”, and “s3:GetBucketLocation” permissions. Warning To avoid misuse or exposure of temporary credentials, Databricks recommends that you set expiration horizons that are just long enough to complete the task. COPY INTO supports loading encrypted data from AWS S3. To load encrypted data, provide the type of encryption and the key to decrypt the data. Load data using temporary credentials Load data using temporary credentials The following example loads data from S3 and ADLS Gen2 using temporary credentials to provide access to the source data. COPY INTO my_json_data FROM 's3://my-bucket/jsonData' WITH ( CREDENTIAL (AWS_ACCESS_KEY = '...', AWS_SECRET_KEY = '...', AWS_SESSION_TOKEN = '...') ) FILEFORMAT = JSON COPY INTO my_json_data FROM 'abfss://container@storageAccount.dfs.core.windows.net/jsonData' WITH ( CREDENTIAL (AZURE_SAS_TOKEN = '...') ) FILEFORMAT = JSON Load encrypted data Load encrypted data Using customer-provided encryption keys, the following example loads data from S3. COPY INTO my_json_data FROM 's3://my-bucket/jsonData' WITH ( ENCRYPTION (TYPE = 'AWS_SSE_C', MASTER_KEY = '...') ) FILEFORMAT = JSON Load JSON data using credentials for source and target Load JSON data using credentials for source and target The following example loads JSON data from a file on AWS S3 into the external Delta table called my_json_data. This table must be created before COPY INTO can be executed. The command uses one existing credential to write to external Delta table and another to read from the S3 location. COPY INTO my_json_data WITH (CREDENTIAL target_credential) FROM 's3://my-bucket/jsonData' WITH (CREDENTIAL source_credential) FILEFORMAT = JSON FILES = ('f.json')"
25286,"## URL: https://docs.databricks.com/en/ingestion/copy-into/tutorial-dbsql.html ## Content: Load data using COPY INTO with an instance profile This article describes how to use the COPY INTO command to load data from an Amazon S3 bucket in your AWS account into a table in Databricks SQL. The steps in this article assume that your admin has configured a SQL warehouse to use an AWS instance profile so that you can access your source files in S3. If your admin configured a Unity Catalog external location with a storage credential, see Load data using COPY INTO with Unity Catalog volumes or external locations instead. If your admin gave you temporary credentials (an AWS access key ID, a secret key, and a session token), see Load data using COPY INTO with temporary credentials instead. Databricks recommends using the COPY INTO command for incremental and bulk data loading with Databricks SQL. Note COPY INTO works well for data sources that contain thousands of files. Databricks recommends that you use Auto Loader for loading millions of files, which is not supported in Databricks SQL. Before you begin Before you begin Before you load data into Databricks, make sure you have the following: Access to data in S3. Your admin must first complete the steps in Configure data access for ingestion so your Databricks SQL warehouse can read your source files. A Databricks SQL warehouse that uses the instance profile that your admin created. The Can manage permission on the SQL warehouse. The fully qualified S3 URI. Familiarity with the Databricks SQL user interface. Step 1: Confirm access to data in cloud storage Step 1: Confirm access to data in cloud storage To confirm that you have access to the correct data in cloud object storage, do the following: In the sidebar, click Create > Query. In the SQL editor’s menu bar, select a SQL warehouse. In the SQL editor, paste the following code: select * from csv. Replace with the S3 URI that you received from your admin. For example, s3:////. Click Run. Step 2: Create a table Step 2: Create a table This step describes how to create a table in your Databricks workspace to hold the incoming data. In the SQL editor, paste the following code: CREATE TABLE .. ( tpep_pickup_datetime TIMESTAMP, tpep_dropoff_datetime TIMESTAMP, trip_distance DOUBLE, fare_amount DOUBLE, pickup_zip INT, dropoff_zip INT ); Click Run. Step 3: Load data from cloud storage into the table Step 3: Load data from cloud storage into the table This step describes how to load data from an S3 bucket into the table you created in your Databricks workspace. In the sidebar, click Create > Query. In the SQL editor’s menu bar, select a SQL warehouse and make sure the SQL warehouse is running. In the SQL editor, paste the following code. In this code, replace: with the name of your S3 bucket. with the name of the folder in your S3 bucket. COPY INTO .. FROM 's3:////' FILEFORMAT = CSV FORMAT_OPTIONS ( 'header' = 'true', 'inferSchema' = 'true' ) COPY_OPTIONS ( 'mergeSchema' = 'true' ); SELECT * FROM ..; Note FORMAT_OPTIONS differs depending on FILEFORMAT. In this case, the header option instructs Databricks to treat the first row of the CSV file as a header, and the inferSchema options instructs Databricks to automatically determine the data type of each field in the CSV file. Click Run. Note If you click Run again, no new data is loaded into the table. This is because the COPY INTO command only processes what it considers to be new data. Clean up Clean up You can clean up the associated resources in your workspace if you no longer want to keep them. Delete the tables In the sidebar, click Create > Query. Select a SQL warehouse and make sure that the SQL warehouse is running. Paste the following code: DROP TABLE ..; Click Run. Hover over the tab for this query, and then click the X icon. Delete the queries in the SQL editor In the sidebar, click SQL Editor. In the SQL editor’s menu bar, hover over the tab for each query that you created for this tutorial, and then click the X icon. Additional resources Additional resources The COPY INTO reference article"

0,1
25773,"## URL: https://docs.databricks.com/en/notebooks/ipywidgets.html ## Content: Which third-party Jupyter widgets are supported in Databricks? Databricks provides best-effort support for third-party widgets, such as ipyleaflet, bqplot, and VegaFusion. However, some third-party widgets are not supported. For a list of the widgets that have been tested in Databricks notebooks, contact your Databricks account team. Limitations Limitations A notebook using ipywidgets must be attached to a running cluster. Widget states are not preserved across notebook sessions. You must re-run widget cells to render them each time you attach the notebook to a cluster. The Password and Controller ipywidgets are not supported. HTMLMath and Label widgets with LaTeX expressions do not render correctly. (For example, widgets.Label(value=r'$$\frac{x+1}{x-1}$$') does not render correctly.) Widgets might not render properly if the notebook is in dark mode, especially colored widgets. Widget outputs cannot be used in notebook dashboard views. The maximum message payload size for an ipywidget is 5 MB. Widgets that use images or large text data may not be properly rendered."
25774,"## URL: https://docs.databricks.com/en/notebooks/new-cell-ui-orientation.html ## Content: Databricks notebooks: orientation to the new cell UI The new cell UI is an updated look and feel for Databricks notebooks. This guide is designed to orient users who are familiar with the existing notebook UI. Enable the new UI Enable the new UI The quickest way to preview the new UI is the preview tag available in the notebook header. This tag displays the current status. Click the tag and toggle the switch to ON. Then click Reload page next to the toggle. The page reloads with the new cell UI enabled. If you click Don’t show this again to remove the preview tag, you can click View > Developer settings at any time and toggle New cell UI under Experimental features. Orientation Orientation This section describes some commonly used features and where to find them in the new UI. Run button The run button is at the upper-left of the cell. Click the sideways-pointing arrow to run the cell with a single click. Click the downward-pointing arrow to display a menu. When a cell is running, the run button displays a spinner and shows the current time spent running the command. You can click this button to cancel the execution. After a cell has finished running, the last run time and duration appear to the right of the button. Hover your cursor over this to see more details. Cell numbers and titles Cell numbers and titles appear in the center of the cell toolbar. To add or edit a title, click on the cell number or title. Cells with titles now appear in the table of contents to assist you in navigating around a notebook. Add cells shortcut To add a new cell, hover in the space between cells. You have the option to add a new, empty code cell or a Markdown text cell. Hidden code or results To view hidden code or results, click the show icon at the upper-right of the cell. Floating toolbar The toolbar remains visible when you scroll down a large code cell to provide more convenient access to cell status and actions. Focus mode To edit a single cell in full screen mode, use focus mode. Click the focus mode icon in the toolbar. This opens a full screen editor for the cell. Any results appear in the bottom panel. You can navigate to adjacent cells using the arrows on either side of the cell title or using the notebook table of contents. Drag and drop to re-order cells To move a cell up or down, click and hold the drag handle icon at the left of the cell. Move the cell to the desired location and release the mouse. Frequently asked questions Frequently asked questions Can I remove the margin on the sides of the cell? You can toggle this preference using the View > Centered layout setting in the notebook menu. Where can I see detailed run information of a cell? Mouse over the run information next to the run button to see a tooltip with detailed run information. If you have a tabular result output this information is also accessible by hovering over the “Last refreshed” section of the UI. How can I get line numbers back? Use View > Line numbers in the notebook menu to toggle line numbers on or off. Where did the minimize cell icon go? The minimize icon has been removed. To minimize a cell, double-click the drag handle or select Collapse cell in the cell menu. Where did the dashboard icon go? Select Add to dashboard in the cell menu. How can I give additional feedback on the new cell UI? Use the Provide feedback link in the expanded preview tag or, if you have hidden the tag, in the notebook header."
25775,"## URL: https://docs.databricks.com/en/notebooks/notebook-editor.html ## Content: Use the Databricks notebook and file editor This page describes some of the functions available with the Databricks notebook and file editor, including code suggestions and autocomplete, variable inspection, code folding, and side-by-side diffs. When you use the notebook or the file editor, Databricks Assistant is available to help you generate, explain, and debug code. See Use Databricks Assistant for details. You can choose from a selection of editor themes. Select View > Editor theme and make a selection from the menu. Autocomplete Autocomplete Autocomplete automatically completes code segments as you type them. Completable objects include types, classes, and objects, as well as SQL database and table names. For Python cells, the notebook must be attached to a cluster for autocomplete to work, and you must run all cells that define completable objects. For SQL cells, autocomplete suggests keywords and basic syntax even if the notebook is not attached to any compute resource. If the workspace is enabled for Unity Catalog, autocomplete also suggests catalog, schema, table, and column names for tables in Unity Catalog. If the workspace is not enabled for Unity Catalog, the notebook must be attached to a cluster or a SQL warehouse to suggest table or column names. Autocomplete suggestions automatically appear you type in a cell. Use the up and down arrow keys or your mouse to select a suggestion, and press Tab or Enter to insert the selection into the cell. Note Server autocomplete in R notebooks is blocked during command execution. There are two user settings to be aware of: To turn off autocomplete suggestions, toggle Autocomplete as you type. When autocomplete is off, you can display autocomplete suggestions by pressing Ctrl + Space. To prevent Enter from inserting autocomplete suggestions, toggle Enter key accepts autocomplete suggestions. Variable inspection Variable inspection To display information about a variable defined in a SQL or Python notebook, hover your cursor over the variable name. Python variable inspection requires Databricks Runtime 12.2 LTS or above. Go to definition Go to definition When a Python notebook is attached to a cluster, you can quickly go to the definition of a variable, function, or the code behind a %run statement. To do this, right-click the variable or function name, and then click Go to definition or Peek definition. You can also hold down the Cmd key on macOS or Ctrl key on Windows and hover over the variable or function name. The name turns into a hyperlink if a definition is found. The “go to definition” feature is available in Databricks Runtime 12.2 LTS and above. Code folding Code folding Code folding lets you temporarily hide sections of code. This can be helpful when working with long code blocks because it lets you focus on specific sections of code you are working on. To hide code, place your cursor at the far left of a cell. Downward-pointing arrows appear at logical points where you can hide a section of code. Click the arrow to hide a code section. Click the arrow again (now pointing to the right) to show the code. For more details, including keyboard shortcuts, see the VS Code documentation. Multicursor support Multicursor support You can create multiple cursors to make simultaneous edits easier, as shown in the video: To create multiple cursors in a cell: On macOS, hold down the Option key and click in each location to add a cursor. On Windows, hold down the Alt key and click in each location to add a cursor. You also have the option to change the shortcut. See Change shortcut for multicursor and column selection. On macOS, you can create multiple cursors that are vertically aligned by using the keyboard shortcut Option+Command+ up or down arrow key. Column (box) selection Column (box) selection To select multiple items in a column, click at the upper left of the area you want to capture. Then: On macOS, press Shift + Option and drag to the lower right to capture one or more columns. On Windows, press Shift + Alt and drag to the lower right to capture one or more columns. You also have the option to change the shortcut. See Change shortcut for multicursor and column selection. Change shortcut for multicursor and column selection Change shortcut for multicursor and column selection An alternate shortcut is available for multicursor and column (box) selection. With the alternate selection, the shortcuts change as follows: To create multiple cursors in a cell: On macOS, hold down the Cmd key and click in each location to add a cursor. On Windows, hold down the Ctrl key and click in each location to add a cursor. To select multiple items in a column, click at the upper left of the area you want to capture. Then: On macOS, press Option and drag to the lower right to capture one or more columns. On Windows, press Alt and drag to the lower right to capture one or more columns. To enable the alternate shortcuts, do the following: Click your username at the upper-right of the workspace, then click Settings in the dropdown list. In the Settings sidebar, select Developer. In the Code editor section, change the Key modifier for multi-cursor click setting to Cmd for macOS or Ctrl for Windows. When you enable alternate shortcuts, the keyboard shortcut for creating multiple cursors that are vertically aligned does not change. Bracket matching Bracket matching When you click near a parenthesis, square bracket, or curly brace, the editor highlights that character and its matching bracket. Side-by-side diff in version history Side-by-side diff in version history When you display previous notebook versions, the editor displays side-by-side diffs with color highlighting."
25776,"## URL: https://docs.databricks.com/en/notebooks/notebook-editor.html ## Content: Syntax error highlighting Syntax error highlighting When a notebook is connected to a cluster, syntax errors are highlighted by a squiggly red line. For Python, the cluster must be running Databricks Runtime 12.2 LTS or above. To enable or disable syntax error highlighting, do the following: Click your username at the upper-right of the workspace, then click Settings in the dropdown list. In the Settings sidebar, select Developer. In the Code editor section, toggle the setting for SQL syntax error highlighting or Python syntax error highlighting. Possible actions on syntax errors and warnings Possible actions on syntax errors and warnings When you see a syntax error, you can hover over it and select Quick Fix for possible actions. Note This feature uses Databricks Assistant. If you don’t see any actions, that means your administrator needs to enable Databricks Assistant first."
25777,"## URL: https://docs.databricks.com/en/notebooks/notebook-export-import.html ## Content: Export and import Databricks notebooks This page describes how to import and export notebooks in Databricks and the notebook formats that Databricks supports. Supported notebook formats Supported notebook formats Databricks can import and export notebooks in the following formats: Source file: A file containing only source code statements with the extension .scala, .py, .sql, or .r. HTML: A Databricks notebook with the extension .html. Databricks .dbc archive. IPython notebook: A Jupyter notebook with the extension .ipynb. RMarkdown: An R Markdown document with the extension .Rmd. Import a notebook Import a notebook You can import an external notebook from a URL or a file. You can also import a ZIP archive of notebooks exported in bulk from a Databricks workspace. Click Workspace in the sidebar. Do one of the following: Right-click on a folder and select Import. To import a notebook at the top level of the current workspace folder, click the kebab menu at the upper right and select Import. Specify the URL or browse to a file containing a supported external format or a ZIP archive of notebooks exported from a Databricks workspace. Click Import. If you choose a single notebook, it is exported in the current folder. If you choose a DBC or ZIP archive, its folder structure is recreated in the current folder and each notebook is imported. Import a file and convert it to a notebook Import a file and convert it to a notebook You can convert Python, SQL, Scala, and R scripts to single-cell notebooks by adding a comment to the first cell of the file: # Databricks notebook source -- Databricks notebook source // Databricks notebook source # Databricks notebook source To define cells in a script, use the special comment shown below. When you import the script to Databricks, cells are created as marked by the COMMAND lines shown. # COMMAND ---------- -- COMMAND ---------- // COMMAND ---------- # COMMAND ---------- Export notebooks Export notebooks Note When you export a notebook as HTML, IPython notebook (.ipynb), or archive (DBC), and you have not cleared the command outputs, the outputs are included in the export. To export a notebook, select File > Export in the notebook toolbar and select the export format. To export all folders in a workspace folder as a ZIP archive: Click Workspace in the sidebar. Right-click the folder and select Export. Select the export format: DBC Archive: Export a Databricks archive, a binary format that includes metadata and notebook command outputs. Source File: Export a ZIP archive of notebook source files, which can be imported into a Databricks workspace, used in a CI/CD pipeline, or viewed as source files in each notebook’s default language. Notebook command outputs are not included. HTML Archive: Export a ZIP archive of HTML files. Each notebook’s HTML file can be imported into a Databricks workspace or viewed as HTML. Notebook command outputs are included."
25778,"## URL: https://docs.databricks.com/en/notebooks/notebook-isolation.html ## Content: Notebook isolation Notebook isolation refers to the visibility of variables and classes between notebooks. Databricks supports two types of isolation: Variable and class isolation Spark session isolation Note Databricks manages user isolation using access modes configured on clusters. No isolation shared: Multiple users can use the same cluster. Users share credentials set at the cluster level. No data access controls are enforced. Single User: Only the named user can use the cluster. All commands run with that user’s privileges. Table ACLs in the Hive metastore are not enforced. This access mode supports Unity Catalog. Shared: Multiple users can use the same cluster. Users are fully isolated from one another, and each user runs commands with their own privileges. Table ACLs in the Hive metastore are enforced. This access mode supports Unity Catalog. Variable and class isolation Variable and class isolation Variables and classes are available only in the current notebook. For example, two notebooks attached to the same cluster can define variables and classes with the same name, but these objects are distinct. To define a class that is visible to all notebooks attached to the same cluster, define the class in a package cell. Then you can access the class by using its fully qualified name, which is the same as accessing a class in an attached Scala or Java library. Spark session isolation Spark session isolation Every notebook attached to a cluster has a pre-defined variable named spark that represents a SparkSession. SparkSession is the entry point for using Spark APIs as well as setting runtime configurations. Spark session isolation is enabled by default. You can also use global temporary views to share temporary views across notebooks. See CREATE VIEW. To disable Spark session isolation, set spark.databricks.session.share to true in the Spark configuration. Important Setting spark.databricks.session.share true breaks the monitoring used by both streaming notebook cells and streaming jobs. Specifically: The graphs in streaming cells are not displayed. Jobs do not block as long as a stream is running (they just finish “successfully”, stopping the stream). Streams in jobs are not monitored for termination. Instead you must manually call awaitTermination(). Calling the Create a new visualization on streaming DataFrames doesn’t work. Cells that trigger commands in other languages (that is, cells using %scala, %python, %r, and %sql) and cells that include other notebooks (that is, cells using %run) are part of the current notebook. Thus, these cells are in the same session as other notebook cells. By contrast, a notebook workflow runs a notebook with an isolated SparkSession, which means temporary views defined in such a notebook are not visible in other notebooks."
25779,"## URL: https://docs.databricks.com/en/notebooks/notebook-outputs.html ## Content: Notebook outputs and results After you attach a notebook to a cluster and run one or more cells, your notebook has state and displays outputs. This section describes how to manage notebook state and outputs. Clear notebooks state and outputs Clear notebooks state and outputs To clear the notebook state and outputs, select one of the Clear options at the bottom of the Run menu. Menu option Description Clear all cell outputs Clears the cell outputs. This is useful if you are sharing the notebook and do not want to include any results. Clear state Clears the notebook state, including function and variable definitions, data, and imported libraries. Clear state and outputs Clears both cell outputs and the notebook state. Clear state and run all Clears the notebook state and starts a new run. Show results Show results When a cell is run, table results return a maximum of 10,000 rows or 2 MB, whichever is less. By default, text results return a maximum of 50,000 characters. With Databricks Runtime 12.2 LTS and above, you can increase this limit by setting the Spark configuration property spark.databricks.driver.maxReplOutputLength. Explore SQL cell results in Python notebooks natively using Python You can load data using SQL and explore it using Python. In a Databricks Python notebook, table results from a SQL language cell are automatically made available as a Python DataFrame. For details, see Explore SQL cell results in Python notebooks. New cell result table Preview This feature is in Public Preview. You can now select a new cell result table rendering. With the new result table, you can do the following: Copy a column or other subset of tabular results to the clipboard. Do a text search over the results table. Sort and filter data. Navigate between table cells using the keyboard arrow keys. Select part of a column name or cell value by double-clicking and dragging to select the desired text. To enable the new result table, click New result table in the upper-right corner of the cell results, and change the toggle selector from OFF to ON. When the feature is on, you can click column or row headers to select entire columns or rows, and you can click in the upper-left cell of the table to select the entire table. You can drag your cursor across any rectangular set of cells to select them. To copy the selected data to the clipboard, press Cmd + c on MacOS or Ctrl + c on Windows, or right-click and select Copy from the drop-down menu. To search for text in the results table, enter the text in the Search box. Matching cells are highlighted. To open a side panel that displays information about the selection, click the panel icon icon in the upper-right corner, next to the Search box. Column headers indicate the data type of the column. For example, indicates integer data type. Hover over the indicator to see the data type. Sort and filter results When you use the new cell result table rendering, you can sort and filter results. To sort the table by the values in a column, hover your cursor over the column name. At the right of the cell containing the column name, an icon appears. Click the arrow to sort the column. Successive clicks toggle through sorting in ascending order, descending order, or unsorted. To sort by multiple columns, hold down the Shift key as you click the sort arrow for the columns. To create a filter, click at the upper-right of the cell results. In the dialog that appears, select the column to filter on and the filter rule and value to apply. For example: To add another filter, click . To temporarily enable or disable a filter, toggle the Enabled/Disabled button in the dialog. To delete a filter, click the X next to the filter name . To filter by a specific value, right-click on a cell with that value and select Filter by this value from the drop-down menu. You can also create a filter from the kebab menu in the column name: Filters are applied only to the results shown in the results table. If the data returned is truncated (for example, when a query returns more than 64,000 rows), the filter is applied only to the returned rows. Download results Download results By default downloading results is enabled. To toggle this setting, see Manage the ability to download results from notebooks. You can download a cell result that contains tabular output to your local machine. Click the downward pointing arrow next to the tab title. The menu options depend on the number of rows in the result and on the Databricks Runtime version. Downloaded results are saved on your local machine as a CSV file named export.csv. View multiple outputs per cell View multiple outputs per cell Python notebooks and %python cells in non-Python notebooks support multiple outputs per cell. For example, the output of the following code includes both the plot and the table: import pandas as pd from sklearn.datasets import load_iris data = load_iris() iris = pd.DataFrame(data=data.data, columns=data.feature_names) ax = iris.plot() print(""plot"") display(ax) print(""data"") display(iris) Commit notebook outputs in Databricks Git folders Commit notebook outputs in Databricks Git folders To learn about committing .ipynb notebook outputs, see Allow committing .ipynb notebook output. The notebook must be an .ipynb file Workspace admin settings must allow notebook outputs to be committed."
25780,"## URL: https://docs.databricks.com/en/notebooks/notebook-ui.html ## Content: Databricks notebook interface and controls The notebook toolbar includes menus and icons that you can use to manage and edit the notebook. Next to the notebook name are buttons that let you change the default language of the notebook and, if the notebook is included in a Databricks Git folder, open the Git dialog. To view previous versions of the notebook, click the “Last edit…” message to the right of the menus. Updated cell design Updated cell design Preview This feature is in Public Preview. An updated cell design is available. This page includes information about how to use both versions of the cell design. For an orientation to the new UI and answers to common questions, see Orientation to the new cell UI. To enable or disable the new cell design, open the editor settings page in the workspace. In the sidebar, click Developer. Under Experimental features, toggle New cell UI. Notebook cells Notebook cells Notebooks contain a collection of two types of cells: code cells and Markdown cells. Code cells contain runnable code. Markdown cells contain Markdown code that renders into text and graphics when the cell is executed and can be used to document or illustrate your code. You can add or remove cells to your notebook to structure your work. You can run a single cell, a group of cells, or run the whole notebook at once. A notebook cell can contain at most 10MB. Notebook cell output is limited to 20MB. Notebook toolbar icons and buttons Notebook toolbar icons and buttons The icons and buttons at the right of the toolbar are described in the following table: Icon Description Run all cells or stop execution. The name of this button changes depending on the state of the notebook. Open compute selector. When the notebook is connected to a cluster or SQL warehouse, this button shows the name of the compute resource. Open job scheduler. Open Delta Live Tables. This button appears only if the notebook is part of a Delta Live Tables pipeline. Open permissions dialog. Right sidebar actions Right sidebar actions Several actions are available from the notebook’s right sidebar, as described in the following table: Icon Description Open notebook comments. Open MLflow notebook experiment. Open notebook version history. Open variable explorer. (Available for Python variables with Databricks Runtime 12.2 LTS and above.) Open the Python environment panel. This panel shows all Python libraries available to the notebook, including notebook-scoped libraries, cluster libraries, and libraries included in the Databricks Runtime. Available only when the notebook is attached to a cluster. Browse data Browse data Preview This feature is in Public Preview. To explore tables and volumes available to use in the notebook, click at the left side of the notebook to open the schema browser. See Browse data for more details. Cell actions menu Cell actions menu The cell actions menu lets you cut and copy cells, move cells around in the notebook, and hide code or results. The menu has a different appearance in the original UI and the new UI. This section includes instructions for both versions. If Databricks Assistant is enabled in your workspace, you can use it in a code cell to get help or suggestions for your code. To open a Databricks Assistant text box in a cell, click the Databricks Assistant icon in the upper-right corner of the cell. You can easily change a cell between code and markdown, or change the language of a code cell, using the cell language button near the upper-right corner of the cell. Cell actions menu (original UI) From this menu you can also run code cells: The cell action menu also includes buttons that let you hide a cell or delete a cell . For Markdown cells, there is also an option to add the cell to a dashboard. For more information, see Dashboards in notebooks. Work with cells in the new UI The following screenshot describes the icons that appear at the upper-right of a notebook cell: Language selector: Select the language for the cell. Databricks Assistant: Enable or disable Databricks Assistant for code suggestions in the cell. Cell focus: Enlarge the cell to make it easier to edit. Display cell actions menu: Open the cell actions menu. The options in this menu are slightly different for code and Markdown cells. To run code cells in the new UI, click the down arrow at the upper-left of the code cell. After a cell has been run, a notice appears to the right of the cell run menu, showing the last time the cell was run and the duration of the run. Hover your cursor over the notice for more details. To add a Markdown cell or a cell that has tabular results to a dashboard, select Add to dashboard from the cell actions menu. For more information, see Dashboards in notebooks. To delete a cell, click the trash icon to the right of the cell. This icon only appears when you hover your cursor over the cell. To add a comment to code in a cell, highlight the code. To the right of the cell, a comment icon appears. Click the icon to open the comment box. To move a cell up or down, click and hold outside the upper-left corner of the cell, and drag the cell to the new location. You can also select Move up or Move down from the cell actions menu. Create cells"
25781,"## URL: https://docs.databricks.com/en/notebooks/notebook-ui.html ## Content: Create cells Notebooks have two types of cells: code and Markdown. The contents of Markdown cells are rendered into HTML. For example, this snippet contains markup for a level-one heading: %md ### Libraries Import the necessary libraries. renders as shown: Create a cell (original UI) To create a new cell in the original UI, hover over a cell at the top or bottom and click the icon. You can also use the notebook cell menu: click and select Add Cell Above or Add Cell Below. For a code cell, just type code into the cell. To create a Markdown cell, select Markdown from the cell’s language button or type %md at the top of the cell. Create a cell (new UI) To create a new cell in the new UI, hover over a cell at the top or bottom. Click on Code or Text to create a code or Markdown cell, respectively. Cut, copy, and paste cells Cut, copy, and paste cells There are several options to cut and copy cells. If you are using the Safari browser, only the keyboard shortcuts are available. From the cell actions menu in the original UI or the new UI, select Cut cell or Copy cell. Use keyboard shortcuts: Command-X or Ctrl-X to cut and Command-C or Ctrl-C to copy. Use the Edit menu at the top of the notebook. Select Cut or Copy. After you cut or copy cells, you can paste those cells elsewhere in the notebook, into a different notebook, or into a notebook in a different browser tab or window. To paste cells, use the keyboard shortcut Command-V or Ctrl-V. The cells are pasted below the current cell. To undo cut or paste actions, you can use the keyboard shortcut Command-Z or Ctrl-Z or the menu options Edit > Undo cut cells or Edit > Undo paste cells. To select adjacent cells, click in a Markdown cell and then use Shift + Up or Down to select the cells above or below it. Use the edit menu to copy, cut, paste, or delete the selected cells as a group. To select all cells, select Edit > Select all cells or use the command mode shortcut Cmd+A. Notebook table of contents Notebook table of contents To display an automatically generated table of contents, click the icon at the upper left of the notebook (between the left sidebar and the topmost cell). The table of contents is generated from the Markdown headings used in the notebook. If you are using the new UI, cells with titles also appear in the table of contents. Cell display options Cell display options There are three display options for notebooks. Use the View menu to change the display option. Standard view: results are displayed immediately after code cells. Results only: only results are displayed. Side-by-side: code and results cells are displayed side by side. In the new UI, actions are available from icons in the cell gutter (the area to the right and left of the cell). For example, to move a cell up or down, use the grip dots in the left gutter. To delete a cell, use the trash can icon in the right gutter. For easier editing, click the focus mode icon to display the cell at full width. To exit focus mode, click . You can also enlarge the displayed width of a cell by turning off View > Centered layout. To automatically format all cells in the notebook to industry standard line lengths and spacing, select Edit > Format notebook. Line and command numbers Line and command numbers To show or hide line numbers or command numbers, select Line numbers or Command numbers from the View menu. For line numbers, you can also use the keyboard shortcut Control+L. If you enable line or command numbers, Databricks saves your preference and shows them in all of your other notebooks for that browser. Line and command numbers (original UI) Command numbers above cells link to that specific command. If you click the command number for a cell, it updates your URL to be anchored to that command. To get a URL link to a specific command in your notebook, right-click the command number and choose Copy Link Address. Line and command numbers (new UI) Line numbers are off by default in the new UI. To turn them on, select View > Line numbers. When a cell is in an error state, line numbers are displayed regardless of the selection. To toggle command numbers, select View > Command numbers. The new UI does not include cell command number links. To get a URL link to a specific command in your notebook, click to open focus mode, and copy the URL from the browser address bar. To exit focus mode, click . Add a cell title Add a cell title To add a title to a cell using the original UI, select Show Title from the cell actions menu. To add a title to a cell using the new UI, do one of the following: Click the cell number shown at the center of the top of the cell and type the title. Select Add title from the cell actions menu. With the new UI, cells that have titles appear in the notebook’s table of contents. View notebooks in dark mode View notebooks in dark mode You can choose to display notebooks in dark mode. To turn dark mode on or off, select View > Theme and select Light theme or Dark theme. Hide and show cell content"
25782,"## URL: https://docs.databricks.com/en/notebooks/notebook-ui.html ## Content: Hide and show cell content Cell content consists of cell code and the results generated by running the cell. You can hide and show the cell code and result using the cell actions menu at the upper-right of the cell. For related functionality, see Collapsible headings. Hide and show cell content (original UI) To hide cell code or results, click and select Hide Code or Hide Result. You can also select to display only the first line of a cell. To show hidden cell code or results, click the Show links: Hide and show cell content (new UI) To hide cell code or results, click the kebab menu at the upper-right of the cell and select Hide code or Hide result. You can also select Collapse cell to display only the first line of a cell. To expand a collapsed cell, select Expand cell. To show hidden cell code or results, click the show icon: . Collapsible headings Collapsible headings Cells that appear after cells containing Markdown headings can be collapsed into the heading cell. To expand or collapse cells after cells containing Markdown headings throughout the notebook, select Collapse all headings from the View menu. The rest of this section describes how to expand or collapse a subset of cells. For related functionality, see Hide and show cell content. Expand and collapse headings (original UI) The image shows a level-two heading MLflow setup with the following two cells collapsed into it. To expand and collapse headings, click the + and -. Expand and collapse headings (new UI) The image shows a level-two heading MLflow setup with the following two cells collapsed into it. To expand and collapse headings, hover your cursor over the Markdown cell. Click the arrow that appears to the left of the cell. Compute resources for notebooks Compute resources for notebooks This section covers the options for notebook compute resources. You can run a notebook on a Databricks cluster, or, for SQL commands, you also have the option to use a SQL warehouse, a type of compute that is optimized for SQL analytics. Attach a notebook to a cluster To attach a notebook to a cluster, you need the CAN ATTACH TO cluster-level permission. Important As long as a notebook is attached to a cluster, any user with the CAN RUN permission on the notebook has implicit permission to access the cluster. To attach a notebook to a cluster, click the compute selector in the notebook toolbar and select a cluster from the dropdown menu. The menu shows a selection of clusters that you have used recently or that are currently running. To select from all available clusters, click More…. Click on the cluster name to display a dropdown menu, and select an existing cluster. You can also create a new cluster by selecting Create new resource… from the dropdown menu. Important An attached notebook has the following Apache Spark variables defined. Class Variable Name SparkContext sc SQLContext/HiveContext sqlContext SparkSession (Spark 2.x) spark Do not create a SparkSession, SparkContext, or SQLContext. Doing so will lead to inconsistent behavior. Use a notebook with a SQL warehouse When a notebook is attached to a SQL warehouse, you can run SQL and Markdown cells. If you run a cell in any other language (such as Python or R), it throws an error. SQL cells executed on a SQL warehouse appear in the SQL warehouse’s query history. The user who ran a query can view the query profile from the notebook by clicking the elapsed time at the bottom of the output. Running a notebook requires a Pro or Serverless SQL warehouse. You must have access to the workspace and the SQL warehouse. To attach a notebook to a SQL warehouse do the following: Click the compute selector in the notebook toolbar. The dropdown menu shows compute resources that are currently running or that you have used recently. SQL warehouses are marked with . From the menu, select a SQL warehouse. To see all available SQL warehouses, select More… from the dropdown menu. A dialog appears showing compute resources available for the notebook. Select SQL Warehouse, choose the warehouse you want to use, and click Attach. You can also select a SQL warehouse as the compute resource for a SQL notebook when you create a workflow or scheduled job. Limitations of SQL warehouses include: When attached to a SQL warehouse, execution contexts have an idle timeout of 8 hours. The maximum size for returned results is 10,000 rows or 2MB, whichever is smaller. Detach a notebook To detach a notebook from a compute resource, click the compute selector in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. From the side menu, select Detach. You can also detach notebooks from a cluster using the Notebooks tab on the cluster details page. When you detach a notebook, the execution context is removed and all computed variable values are cleared from the notebook. Tip Databricks recommends that you detach unused notebooks from clusters. This frees up memory space on the driver. Use web terminal and Databricks CLI"


## Create and Test Vector Index

In this step, we will compute embeddings for a dataset containing information about products and store them in a Vector Search index using Databricks Vector Search.

**🚨IMPORTANT: Vector Search endpoints must be created before running the rest of the demo. These are already created for you in Databricks Lab environment.**

### Creating a Vector Index via UI


**Steps to Create a Vector Index:**

- Navigate to **Catalog** from the left panel and select your course catalog. 

- Choose your schema. Schema name is printed in the top of the notebook. 

- Select the **`product_text`** table created in the previous step. You can find all resource names at the top of this notebook.

- In the top right corner, click **"Create"** and then **"Vector search index"**.

- Enter **`product_embeddings`** as the index name.

- Choose **`id`** as the primary key.

- Choose `document` field as column(s) to sync.

- For embedding source, select **"compute embeddings"**:
  - Choose **`document` column** as the source column.
  - Select **`databricks-gte-large-en`** as the embedding model. Embedding creation will be **managed** by Databricks which means we don't need to manually compute embeddings.
  - Select the endpoint assigned to you. Refer to the top of this notebook for details on the VS endpoint.  

- Set sync mode to **"Triggered"**.

- Finally, click the **"Create"** button.


### (Alternative) Creating a Vector Index via API 

For simplicity, we created the vector index using the UI. However, this process can also be accomplished programmatically using the `databricks-sdk`. For more detailed instructions, please refer to [documentation page](https://docs.databricks.com/en/generative-ai/create-query-vector-search.html#create-index-using-the-python-sdk).

### Testing the Index: Search for Products Similar to the Query

Before building the RAG pipeline, let's check if the index is ready. We will use `similarity_search` function to search for products that are similar to the query text.

In [0]:
from databricks.vector_search.client import VectorSearchClient

vsc = VectorSearchClient(disable_notice=True)

question = "How to use Auto Loader in Databricks?"

try:
    # get index created in the previous step
    index = vsc.get_index(vs_endpoint_name, vs_index_table_name)

    # search for similar documents
    results = index.similarity_search(
        query_text = question,
        columns=["document"],
        num_results=4
    )

    # show the results
    docs = results.get("result", {}).get("data_array", [])
    pprint(docs)

except Exception as e:
    print(f"Error occurred while loading the index. Did you create an index in the previous step?: {e}")

[['## URL: https://docs.databricks.com/en/ingestion/auto-loader/production.html\n'
  '\n'
  '## Content: Configure Auto Loader for production workloads  \n'
  'Databricks recommends that you follow the streaming best practices for running Auto Loader in '
  'production.  \n'
  'Databricks recommends using Auto Loader in Delta Live Tables for incremental data ingestion. '
  'Delta Live Tables extends functionality in Apache Spark Structured Streaming and allows you to '
  'write just a few lines of declarative Python or SQL to deploy a production-quality data '
  'pipeline with:  \n'
  'Autoscaling compute infrastructure for cost savings  \n'
  'Data quality checks with expectations  \n'
  'Automatic schema evolution handling  \n'
  'Monitoring via metrics in the event log  \n'
  'Monitoring Auto Loader\n'
  'Monitoring Auto Loader\n'
  'Querying files discovered by Auto Loader  \n'
  'Note  \n'
  'The cloud_files_state function is available in Databricks Runtime 11.3 LTS and above.  \n


**💡 Question:** Four similar documents are returned. What should we do if we want to **use only two of these documents** or if we want to **add documents into the context based on their importance or higher similarity**?

## Enable MLflow Tracing

MLflow supports auto-logging for LangChain models. Before we begin constructing the chains, we will enable auto-logging as shown below.

In [0]:
import mlflow
mlflow.langchain.autolog()

## Build a RAG Model

With our contextual information prepared and indexed in Vector Search, we can proceed to build a RAG chain.

With MLflow tracing enabled, you will be able to inspect the LangChain pipeline.

**💡 Question:** Which documents are retrieved from Vector Search and used as "context"?

In [0]:
import mlflow
from operator import itemgetter
from databricks.vector_search.client import VectorSearchClient
from langchain_databricks import ChatDatabricks, DatabricksVectorSearch, DatabricksEmbeddings
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import (
    PromptTemplate,
    ChatPromptTemplate,
)
from langchain_core.runnables import RunnableLambda
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# define a retriever function that will be used in the chain
def vector_search_as_retriever(persist_dir=None):
    vectorstore = DatabricksVectorSearch(vs_index_table_name)
    return vectorstore.as_retriever(search_kwargs={"k": 3})

# Return the string contents of the most recent messages: [{...}] from the user to be used as input question
def extract_user_query_string(chat_messages_array):
    return chat_messages_array[-1]["content"]

def format_context(docs):
    chunk_contents = [f"Passage: {d.page_content}\n" for d in docs]
    return "".join(chunk_contents)

# define template for prompt
prompt_template = PromptTemplate.from_template(
    """
    You are a documentation assistant. Use the context below to answer the question clearly and accurately.

    <context>
    {context}
    </context>

    Question: {question}

    Answer:
    """
)

# define foundation model for generating responses
model = ChatDatabricks(endpoint="databricks-meta-llama-3-3-70b-instruct", max_tokens = 500, temperature=0.8)

# RAG chain
chain = (
    {
        "question": itemgetter("messages") | RunnableLambda(extract_user_query_string),
        "context": itemgetter("messages")
        | RunnableLambda(extract_user_query_string)
        | vector_search_as_retriever
        | RunnableLambda(format_context),
    }
    | prompt_template
    | model
    | StrOutputParser()
)

# let's give it a try:
input_example = {"messages": [ {"role": "user", "content": "How do I use Delta Lake time travel?"}]}
answer = chain.invoke(input_example)
print(answer)

  model = ChatDatabricks(endpoint="databricks-meta-llama-3-3-70b-instruct", max_tokens = 500, temperature=0.8)
  vectorstore = DatabricksVectorSearch(vs_index_table_name)


[NOTICE] Using a notebook authentication token. Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True to VectorSearchClient().
To use Delta Lake time travel, you can query a Delta table by adding a clause after the table name specification. The syntax allows you to specify a timestamp or a version. Here are the ways to use Delta Lake time travel:

1. **Using `TIMESTAMP AS OF` clause**: You can specify a timestamp using the `TIMESTAMP AS OF` clause. For example: `SELECT * FROM people10m TIMESTAMP AS OF '2018-10-18T22:15:12.013Z'`
2. **Using `VERSION AS OF` clause**: You can specify a version using the `VERSION AS OF` clause. For example: `SELECT * FROM delta.`/tmp/delta/people10m` VERSION AS OF 123`
3. **Using `@` syntax**: You can use the `@` syntax to specify the timestamp or version as part of the table name. For example: `SELECT * FROM people10m@20190101000000000` or `SELECT * 

Trace(request_id=tr-2cf053dcc50648c6b35c0fe5d8bb5cf8)

## Save the Model to Model Registry in Unity Catalog

Now that our chain is ready and evaluated, we can register it within our Unity Catalog schema. 

After registering the chain, you can view the chain and models in the **Catalog Explorer**.

In [0]:
import mlflow
import langchain
import langchain_community
import databricks.vector_search
from mlflow.models import infer_signature
from mlflow.models.resources import (
    DatabricksVectorSearchIndex
)

# Set Model Registry URI to Unity Catalog
mlflow.set_registry_uri("databricks-uc")

model_name = f"{DA.catalog_name}.{DA.schema_name}.getstarted_genai_rag_demo"
input_example = {"messages": [ {"role": "user", "content": "How to use Auto Loader in Databricks?"}]}

# Register the assembled RAG model in Model Registry with Unity Catalog
with mlflow.start_run(run_name="genai_gs_demo_02_01") as run:
    signature = infer_signature(input_example, answer)
    model_info = mlflow.langchain.log_model(
        lc_model=chain,
        artifact_path="chain",
        input_example=input_example,
        signature=signature,
        pip_requirements=[
            "langchain==" + langchain.__version__,
            "langchain-community==" + langchain_community.__version__,
            "databricks-vectorsearch==" + databricks.vector_search.__version__,
            "langchain-databricks"
        ],
        resources=[
            DatabricksVectorSearchIndex(index_name=vs_index_table_name)
        ]
    )

mlflow.register_model(model_info.model_uri, model_name)



[NOTICE] Using a notebook authentication token. Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True to VectorSearchClient().


Uploading artifacts:   0%|          | 0/23 [00:00<?, ?it/s]

Successfully registered model 'dbacademy.labuser10813094_1751481371.getstarted_genai_rag_demo'.


## Clean up Classroom

**🚨 Warning:** Please refrain from deleting the catalog and tables created in this demo, as they are required for upcoming demonstrations. To clean up the classroom assets, execute the classroom clean-up script provided in the final demo.

## Summary

In this demo, first, we created a "managed" vector search index using a delta table. After creating the index, we searched for documents similar to the input query. In the second part of the demo, we built and registered the RAG pipeline to Unity Catalog Model Registry. Before registering the model, we enabled mlflow's autolog to trace model run. In the next demos, we will show to evaluate the performance of this model and deploy it into production.


&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="blank">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy" target="blank">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use" target="blank">Terms of Use</a> | 
<a href="https://help.databricks.com/" target="blank">Support</a>