# Databricks AutoML Lab

## Getting Started

### By the end of this lab you will have learned:

1. How to upload data to a Unity Catalog Volume

2. How to use the Databricks AutoML to help you quickly generate baseline models and notebooks. 

3. Assess which models are the best performing ones in the UI 

4. Customize the generated model for better performance




## 0. Lab Setup
1. Start your ML Cluster before adding data. You can find clusters in the `Compute` Tab on the left.

## 1. Upload a .csv File to create a new table - Daily Sales Data

1. Download the [Daily Sales Data](https://github.com/julie-nguyen-ds/asean-workshops-2024/blob/main/Datasets/DI%20Platform%20Lab/product_description.tsv) file to your local computer <br />
2. Navigate to the `Catalog Explorer` and create a `schema` in the catalog
<br /><img style="float:right" src="https://github.com/julie-nguyen-ds/asean-workshops-2024/blob/main/Resources/Screenshots/3.0.png?raw=true"/><br />
3. Click on the 3 dot button (vertical elipsis) on the far right side of the file name and select `create table`
4. After the upload completes, you can examine the available data by clicking on the `Sample Data` tab in the `Catalog Explorer` to see if the data was loaded correctly and is displayed as expected. <br />





In [0]:
sales = spark.table("jn_catalog.datasets.fact_apj_sales")
items = spark.table("jn_catalog.datasets.fact_apj_sale_items")

In [0]:
sales_items = sales.join(items, on="sale_id")
display(sales_items)

## Data Prep

In [0]:
joined_data = sales_items.select(["sale_id", "ts", "order_source", "jn_catalog.datasets.fact_apj_sale_items.unique_customer_id", "jn_catalog.datasets.fact_apj_sale_items.store_id", "product_id", "product_cost"])

In [0]:
filtered_store_ids = ["MEL01", "AKL02", "SYD01"]  # Example store IDs to filter
filtered_data = joined_data.filter(joined_data["jn_catalog.datasets.fact_apj_sale_items.store_id"].isin(filtered_store_ids))
display(filtered_data)

In [0]:
filtered_data.count()

In [0]:
from pyspark.sql.functions import sum as spark_sum

summed_data = filtered_data.groupBy("store_id", "product_id", "ts").agg(spark_sum("product_cost").alias("daily_sale"))
display(summed_data)

In [0]:
display(summed_data)

In [0]:
summed_data.count()