### Databricks Notebooks
- Support Scala, Python, SQL, R, and Markdown
- Version control 
- Connect to remote Git Repository via Repos
- Permissions for sharing notebooks
- Schedule as production jobs

### Working with Notebooks
- Create new notebook and set default language

<img src = "https://storage.googleapis.com/databricks-public-images/bmathew/create_notebook.png" height=600 width=600>

<br>
- Import a notebook

<img src = "https://storage.googleapis.com/databricks-public-images/bmathew/import_notebook.png" height=300 width=300>

<br>
- Notebooks in Repos

<img src = "https://storage.googleapis.com/databricks-public-images/bmathew/repos_img1.png" height=400 width=400>

### Attach the notebook to a cluster

### Write Scala, Python, SQL, R, and Markdown into the notebook cells
- Streaming + Batch ETL code, SQL, Machine Learning, Deep Learning, and Graph Analytics
- Java is not directly supported in the notebooks and you have to submit Java JAR files via the RESTP API
<br><br>
- Example below uses Python to read a parquet file from Google Cloud storage into a Spark DataFrame 
- Learn more about DataFrames: https://docs.gcp.databricks.com/getting-started/spark/dataframes.html
- When you first create a notebook, you will set the default language
   - To use another language specify the interpreter in the first line - e.g. %python, %scala, %sql, %r, or %md
- Up to 1000 rows will be displayed in tabular format
   - Can download up to 1 million rows, or you can disable downloading of results for users

In [0]:
%python
df = sqlContext.read.parquet("gs://databricksgcplabs/sales_data")
display(df)

ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,PRODUCTLINE,MSRP,PRODUCTCODE,CUSTOMERNAME,PHONE,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
10107,30,95.7,2,2871.0,2/24/2003 0:00,Shipped,1,2,2003,Motorcycles,95,S10_1678,Land of Toys Inc.,2125557818,897 Long Airport Avenue,,NYC,NY,10022,USA,,Yu,Kwai,Small
10121,34,81.35,5,2765.9,5/7/2003 0:00,Shipped,2,5,2003,Motorcycles,95,S10_1678,Reims Collectables,26.47.1555,59 rue de l'Abbaye,,Reims,,51100,France,EMEA,Henriot,Paul,Small
10134,41,94.74,2,3884.34,7/1/2003 0:00,Shipped,3,7,2003,Motorcycles,95,S10_1678,Lyon Souveniers,+33 1 46 62 7555,27 rue du Colonel Pierre Avia,,Paris,,75508,France,EMEA,Da Cunha,Daniel,Medium
10145,45,83.26,6,3746.7,8/25/2003 0:00,Shipped,3,8,2003,Motorcycles,95,S10_1678,Toys4GrownUps.com,6265557265,78934 Hillside Dr.,,Pasadena,CA,90003,USA,,Young,Julie,Medium
10159,49,100.0,14,5205.27,10/10/2003 0:00,Shipped,4,10,2003,Motorcycles,95,S10_1678,Corporate Gift Ideas Co.,6505551386,7734 Strong St.,,San Francisco,CA,,USA,,Brown,Julie,Medium
10168,36,96.66,1,3479.76,10/28/2003 0:00,Shipped,4,10,2003,Motorcycles,95,S10_1678,Technics Stores Inc.,6505556809,9408 Furth Circle,,Burlingame,CA,94217,USA,,Hirano,Juri,Medium
10180,29,86.13,9,2497.77,11/11/2003 0:00,Shipped,4,11,2003,Motorcycles,95,S10_1678,Daedalus Designs Imports,20.16.1555,"184, chausse de Tournai",,Lille,,59000,France,EMEA,Rance,Martine,Small
10188,48,100.0,1,5512.32,11/18/2003 0:00,Shipped,4,11,2003,Motorcycles,95,S10_1678,Herkku Gifts,+47 2267 3215,"Drammen 121, PR 744 Sentrum",,Bergen,,N 5804,Norway,EMEA,Oeztan,Veysel,Medium
10201,22,98.57,2,2168.54,12/1/2003 0:00,Shipped,4,12,2003,Motorcycles,95,S10_1678,Mini Wheels Co.,6505555787,5557 North Pendale Street,,San Francisco,CA,,USA,,Murphy,Julie,Small
10211,41,100.0,14,4708.44,1/15/2004 0:00,Shipped,1,1,2004,Motorcycles,95,S10_1678,Auto Canal Petit,(1) 47.55.6555,"25, rue Lauriston",,Paris,,75016,France,EMEA,Perrier,Dominique,Medium


- <b>Spark development tip:</b> Cache this DataFrame to memory for faster access when reused in subsequent operations
- Notice the Python language interpreter is not set. That's because the default language for this notebook is already set Python.

In [0]:
df.cache()

- Create temporary table from the cached DataFrame that we can easily perform SQL commands on
- We will learn to create a materialized table in the last lab exercise

In [0]:
df.createOrReplaceTempView("sales_data")

### SQL 
- ANSI SQL Syntax you are already familiar with
- Aggregations, Filtering, Grouping, Sorting, Analytical Functions, Joins, etc....
- Since we are switching from the default language, Python, we must set the sql language interpreter in the first line
- Can perform SQL on materialized tables, views, or in-memory objects

In [0]:
%sql
SELECT productline AS `Product Line`,
concat('$',format_number(sum(sales),2)) AS `Total Sales Revenue`
FROM sales_data WHERE lower(status) = 'shipped' GROUP BY 1 ORDER BY sum(sales) ASC

Product Line,Total Sales Revenue
Trains,"$215,352.57"
Ships,"$591,172.76"
Planes,"$866,466.57"
Trucks and Buses,"$1,044,097.39"
Motorcycles,"$1,129,573.83"
Vintage Cars,"$1,743,077.63"
Classic Cars,"$3,701,760.33"


Output can only be rendered in Databricks

### Visualize the data
- Databricks supports many different types of visualizations to create rich dashboards.
- Connect your BI tools to Databricks via JDBC/ODBC to analyze the data.
- Create your own custom visualizations: HTML, JavaScript, D3, SVG, Matplotlib, ggplot2

In [0]:
%sql
SELECT productline AS `Product Line`,
sum(sales) AS `Total Sales Revenue`
FROM sales_data WHERE lower(status) = 'shipped' GROUP BY 1 ORDER BY sum(sales) ASC

Product Line,Total Sales Revenue
Trains,215352.56999999995
Ships,591172.7599999999
Planes,866466.5700000001
Trucks and Buses,1044097.3899999998
Motorcycles,1129573.8300000003
Vintage Cars,1743077.6299999992
Classic Cars,3701760.329999997


Output can only be rendered in Databricks

### User Defined Functions

- Create function and register as User Defined Function (UDF) to call from SparkSQL or DataFrames
- This example takes 2 input parameters, price and tax%, to calculate total price with tax

In [0]:
import decimal
def price_with_tax_local (price, tax):
  price = decimal.Decimal(price)
  tax = tax / 100
  price_with_tax = price + (price * tax)
  return price_with_tax
sqlContext.udf.register("price_with_tax_local", price_with_tax_local)

- Call function from SQL query to use against the data in a table

In [0]:
%sql
SELECT ordernumber, orderdate, sales, price_with_tax_local(sales, 6.70) AS sales_amount_plus_tax FROM sales_data

ordernumber,orderdate,sales,sales_amount_plus_tax
10107,2/24/2003 0:00,2871.0,3063.357
10121,5/7/2003 0:00,2765.9,2951.2153
10134,7/1/2003 0:00,3884.34,4144.59078
10145,8/25/2003 0:00,3746.7,3997.7289
10159,10/10/2003 0:00,5205.27,5554.02309
10168,10/28/2003 0:00,3479.76,3712.90392
10180,11/11/2003 0:00,2497.77,2665.12059
10188,11/18/2003 0:00,5512.32,5881.64544
10201,12/1/2003 0:00,2168.54,2313.83218
10211,1/15/2004 0:00,4708.44,5023.90548


- Create custom user defined functions available to everyone as Java, Scala, or Python libraries
   - In this example, I created a Python Egg file with the package my_custom_functions that contains various functions
   - I created the Egg file locally on my laptop and uploaded to Databricks. You can upload using the GUI, API, or CLI
   - I will import the libary into Databricks and attach the libary to the cluster and then run the cell
   - Similar example to above UDF, but now source code for function is written using Python
   
- UDFs are executed row by row and if working with very large datasets, use vectorized UDFs with Python Pandas: https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

In [0]:
from my_custom_functions import price_with_tax

df4 = spark.sql("SELECT ordernumber, orderdate, sales FROM sales_data")
output = df4.rdd.map(lambda line: (line[0], line[1], line[2], price_with_tax(line[2],6.70))).toDF(["ordernumber", "orderdate", "sales", "sales_amount_plus_tax"])
display(output)

ordernumber,orderdate,sales,sales_amount_plus_tax
10107,2/24/2003 0:00,2871.0,3063.357
10121,5/7/2003 0:00,2765.9,2951.2153000000003
10134,7/1/2003 0:00,3884.34,4144.59078
10145,8/25/2003 0:00,3746.7,3997.7289
10159,10/10/2003 0:00,5205.27,5554.023090000001
10168,10/28/2003 0:00,3479.76,3712.90392
10180,11/11/2003 0:00,2497.77,2665.12059
10188,11/18/2003 0:00,5512.32,5881.645439999999
10201,12/1/2003 0:00,2168.54,2313.83218
10211,1/15/2004 0:00,4708.44,5023.905479999999


Output can only be rendered in Databricks

### Now that you understand how to develop code using notebooks, let's learn how to connect to data in Google Cloud Storage!

#### [Click here to return to agenda]($./Agenda)