-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Databricks Best Practices

In this notebook, we will explore a wide array of best practices for working with Databricks.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Explore a general framework for debugging slow running jobs
 - Identify the security implications of various data access paradigms
 - Determine various cluster configuration issues including machine types, libraries, and jobs
 - Integrate Databricks notebooks and jobs with version control and the CLI

## Slow Running Jobs

The most common issues with slow running jobs are:<br><br>

- **`Spill`**: Data is exhausting the cluster's memory and is spilling onto disk. Resolution: a cluster with more memory resources
- **`Shuffle`**: Large amounts of data are being transferred across the cluster.  Resolution: optimize joins or refactor code to avoid shuffles
- **`Skew/Stragglers`**: Partitioned data (in files or in memory) is skewed causing the "curse of the last reducer" where some partitions take longer to run.  Resolution: repartition to a multiple of the available cores or use skew hints
- **`Small/Large Files`**: Too many small files are exhausting cluster resources since each file read needs its own thread or few large files are causing unused threads.  Resolution: rewrite data in a more optimized way or perform Delta file compaction

Your debugging toolkit:<br><br>

- Ganglia for CPU, network, and memory resources at a cluster or node level
- Spark UI for most everything else (especially the storage and executor tabs)
- Driver or worker logs for errors (especially with background processes)
- Notebook tab of the clusters section to see if the intern is hogging your cluster again

## Data Access and Security

A few notes on data access:<br><br>

* <a href="https://docs.databricks.com/data/databricks-file-system.html#mount-storage" target="_blank">Mount data for easy access</a>
* <a href="https://docs.databricks.com/dev-tools/cli/secrets-cli.html#secrets-cli" target="_blank">Use secrets to secure credentials</a> (this keeps credentials out of the code)
* Credential passthrough works in <a href="https://docs.databricks.com/dev-tools/cli/secrets-cli.html#secrets-cli" target="_blank">AWS</a> and <a href="https://docs.microsoft.com/en-us/azure/databricks/security/credential-passthrough/adls-passthrough" target="_blank">Azure</a>

## Cluster Configuration, Libraries, and Jobs

Cluster types are:<br><br>

- Memory optimized (with or without <a href="https://docs.databricks.com/delta/optimizations/delta-cache.html" target="_blank">Delta Cache Acceleration</a>
- Compute optimized
- Storage optimized
- GPU accelerated
- General Purpose

General rules of thumb:<br><br>

- Smaller clusters of larger machine types for machine learning
- One cluster per production workload
- Don't share clusters for ML training (even in development)
- <a href="https://docs.databricks.com/clusters/configure.html" target="_blank">See the docs for more specifics</a>

Library installation best practices:<br><br>
  
- <a href="https://docs.databricks.com/libraries/notebooks-python-libraries.html" target="_blank">Notebook-scoped Python libraries</a> ensure users on same cluster can have different libraries.  Also good for saving notebooks with their library dependencies
- <a href="https://docs.databricks.com/clusters/init-scripts.html" target="_blank">Init scripts</a> ensure that code is ran before the JVM starts (good for certain libraries or environment configuration)
- Some configuration variables need to be set on cluster start

Jobs best practices:<br><br>

- Use <a href="https://docs.databricks.com/notebooks/notebook-workflows.html" target="_blank">notebook workflows</a>
- <a href="https://docs.databricks.com/notebooks/widgets.html" target="_blank">Widgets</a> work for parameter passing
- You can also run jars and wheels
- Use the CLI for orchestration tools (e.g. Airflow)
- <a href="https://docs.databricks.com/jobs.html" target="_blank">See the docs for more specifics</a>
- Always specify a timeout interval to prevent infinitely running jobs

## CLI and Version Control

The <a href="https://github.com/databricks/databricks-cli" target="_blank">Databricks CLI</a>:<br><br>

 * Programmatically export out all your notebooks to check into github
 * Can also import/export data, execute jobs, create clusters, and perform most other Workspace tasks

Git integration can be accomplished in a few ways:<br><br>

 * Use the CLI to import/export notebooks and check into git manually
 * <a href="https://docs.databricks.com/notebooks/github-version-control.html" target="_blank">Use the built-in git integration</a>
 * <a href="https://www.youtube.com/watch?v=HsfMmBfQtvI" target="_blank">Use the next generation workspace for alternative project integration</a>

Time permitting: exploring the <a href="https://docs.databricks.com/administration-guide/index.html" target="_blank">admin console!</a>

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>