#### Please read the text in each of the notebook cells. <BR>There are instructions in some of them that require input from you so that the code will run as expected.

### Data Sources
- Databricks can connect to source data from a variety of differnt storage mediums and formats
   - RDBMS
   - Message Bus (e.g. PubSub, Kafka, and more)
   - NoSQL 
   - Files (Delimited text, JSON, Avro, Parquet, ORC, Sequence Files, Images, Documents, and more)
   - Google Cloud Services: BigQuery, Bigtable, CloudSQL, PubSub, and other GCP services
   - In the cloud or on-premise, but Databricks is optimized for cloud object storage - Google Cloud Storage
- Connecting to Data Sources: https://docs.gcp.databricks.com/data/data-sources/index.html

### Authenticating to Google Cloud Storage in Databricks
Use a Service Account 
- Secure and does not expose any sensitive information
- Create a Service Account in Google Cloud, give that Service Account the proper permissions, associate to Google Cloud Storage bucket, <BR>and specify the Service Account used when launching the cluster. 
- There are other ways of using a Service Account instead of asscociating a Service Account to a cluster and you can add the Service Account key values into spark.conf. 
- Visit the online documentation to learn more about how to use Service Accounts with Databricks: https://docs.gcp.databricks.com/data/data-sources/google/gcs.html
  
<img src = "https://storage.googleapis.com/databricks-public-images/bmathew/service_account.png" height=600 width=600>

### Accessing Google Cloud Storage in Databricks
There are two common ways to access data in Google Cloud Storage from Databricks.
1. <b>Mount points</b>
   - Mounting your storage container to the Databricks workspace to be shared by all users and clusters.
   - This works well in a development environment where all users might need access to the data.
   - The mount is only a pointer to a GCS location and the data is never synced locally.
2. <b>No mount points</b>
   - Not using a mount point and instead using a specific service account that you have access to

Both methods require the use of a Service Account

### Using Mount Points
- Create a service account for the Databricks cluster.
   - The service account must be in the Google Cloud project that you used to set up the Databricks workspace.
- Configure the GCS bucket so that the service account has access to it.
- Launch a Databricks cluster with a service account attached to it.
- Note: In the example below, for demonstration purposes, we are using our user_id in the mount name; however, we would not typically do this in a live setting.
   - Make certain to replace the parameter "your-username" with your unique username
   - For example: odl_instructor_490106@databrickslabs.com --> odl_instructor_490106

In [0]:
%python
dbutils.fs.mount("gs://databricksgcplabs","/mnt/odl_user_768559/databricksgcplabs")

- Once mounted, we can view and navigate the contents of our GCS bucket using Databricks %fs file system commands.
- Databricks file system commands has similar Unix style syntax (e.g. ls, head, cp, rm, and mkdirs - Unix is actually mkdir)
- Replace the parameter "your-username" with your unique username

In [0]:
%fs
ls /mnt/odl_user_768559/databricksgcplabs

path,name,size
dbfs:/mnt/odl_user_768559/databricksgcplabs/atm_dataset/,atm_dataset/,0
dbfs:/mnt/odl_user_768559/databricksgcplabs/clickstream-json/,clickstream-json/,0
dbfs:/mnt/odl_user_768559/databricksgcplabs/members/,members/,0
dbfs:/mnt/odl_user_768559/databricksgcplabs/merge-data/,merge-data/,0
dbfs:/mnt/odl_user_768559/databricksgcplabs/my_custom_functions-0.1-py3.6_2022.egg,my_custom_functions-0.1-py3.6_2022.egg,2029
dbfs:/mnt/odl_user_768559/databricksgcplabs/products/,products/,0
dbfs:/mnt/odl_user_768559/databricksgcplabs/sales_data/,sales_data/,0
dbfs:/mnt/odl_user_768559/databricksgcplabs/tmp/,tmp/,0
dbfs:/mnt/odl_user_768559/databricksgcplabs/tmp1/,tmp1/,0
dbfs:/mnt/odl_user_768559/databricksgcplabs/transactions/,transactions/,0


- Unmount the mount point
- Replace the parameter "your-username" with your unique username

In [0]:
%python
dbutils.fs.unmount("/mnt/odl_user_768559/databricksgcplabs")

### No Mount Points
- Create a service account for the Databricks cluster.
   - The service account must be in the Google Cloud project that you used to set up the Databricks workspace.
- Configure the GCS bucket so that the service account has access to it
- Launch a Databricks cluster with a service account attached to it

- Use Databricks %fs file system commands directly against the GCS bucket

In [0]:
%fs
ls gs://databricksgcplabs

path,name,size
gs://databricksgcplabs/tmp/,tmp/,0
gs://databricksgcplabs/tmp1/,tmp1/,0
gs://databricksgcplabs/members/,members/,0
gs://databricksgcplabs/products/,products/,0
gs://databricksgcplabs/user_logs/,user_logs/,0
gs://databricksgcplabs/merge-data/,merge-data/,0
gs://databricksgcplabs/sales_data/,sales_data/,0
gs://databricksgcplabs/atm_dataset/,atm_dataset/,0
gs://databricksgcplabs/transactions/,transactions/,0
gs://databricksgcplabs/clickstream-json/,clickstream-json/,0


In [0]:
%fs
head gs://databricksgcplabs/members/members.csv

- To run multiple commands in the same cell, use Databricks Utilities, or DBUtils, which executs File System commands
- Here is an example that will create a new directory, copy files to it, and them remove the directory and its contents
- The example below is doing a recursive copy so that any subdirectories will also get copied
- Replace the parameter "your-username" with your unique username

In [0]:
%python
dbutils.fs.mkdirs("/tmp/odl_user_768559/data")
dbutils.fs.cp("gs://databricksgcplabs/products","/tmp/odl_user_768559/data", recurse=True)
dbutils.fs.rm("/tmp/odl_user_768559/data", recurse=True)

### Now that you learned how to authenticate with data in GCS, let's complete the final lab and build a Data Lakehouse!

#### [Click here to return to agenda]($./Agenda)