This example notebook closely follows the [Databricks documentation](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html) for how to set up Azure Data Lake Store as a data source in Databricks.

### 0 - Setup

To get set up, do these tasks first: 

- Get service credentials: Client ID `<aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee>` and Client Credential `<NzQzY2QzYTAtM2I3Zi00NzFmLWI3MGMtMzc4MzRjZmk=>`. Follow the instructions in [Create service principal with portal](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal). 
- Get directory ID `<ffffffff-gggg-hhhh-iiii-jjjjjjjjjjjj>`: This is also referred to as *tenant ID*. Follow the instructions in [Get tenant ID](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal#get-tenant-id). 
- If you haven't set up the service app, follow this [tutorial](https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse). Set access at the root directory or desired folder level to the service or everyone.

There are two options to read and write Azure Data Lake data from Azure Databricks:
1. DBFS mount points
2. Spark configs

## 1 - DBFS mount points
[DBFS](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html) mount points let you mount Azure Data Lake Store for all users in the workspace. Once it is mounted, the data can be accessed directly via a DBFS path from all clusters, without the need for providing credentials every time. The example below shows how to set up a mount point for Azure Data Lake Store.

In [0]:
dbutils.fs.unmount('/mnt/lake')

In [0]:
dbutils.fs.mount(
  source = "wasbs://container1@projectaccount123.blob.core.windows.net",
  mount_point = '/mnt/lake',
  extra_configs = {
    'fs.azure.account.key.projectaccount123.blob.core.windows.net':'c9F/bns3QWX59Hx7IDic1V9iSaoB17Gk8y5f0npreZMROsyQ6ausbUeBBUmpGjdtKq17pnBwSOMq+AStiJsEnQ=='
  }
  
)

In [0]:
dbutils.fs.ls('/mnt/lake')

In [0]:
!pip install pyspark

In [0]:
import pandas as pd
df = pd.read_csv('/dbfs/mnt/lake/superstore.csv')

In [0]:
df.info()

In [0]:
df.head()

Unnamed: 0,Order ID,Ship Mode,Segment,Region,Product ID,Sales,Quantity,Discount,Profit
0,CA-2016-152156,Second Class,Consumer,South,FUR-BO-10001798,261.96,2,0%,41.9136
1,CA-2016-152156,Second Class,Consumer,South,FUR-CH-10000454,731.94,3,0%,219.582
2,CA-2016-138688,Second Class,Corporate,West,OFF-LA-10000240,14.62,2,0%,6.8714
3,US-2015-108966,Standard Class,Consumer,South,FUR-TA-10000577,957.5775,5,0.45%,-383.031
4,US-2015-108966,Standard Class,Consumer,South,OFF-ST-10000760,22.368,2,0.20%,2.5164


In [0]:
df['Discount'] = df['Discount'].str.rstrip("%").astype(float)/100

In [0]:
df.head()

Unnamed: 0,Order ID,Ship Mode,Segment,Region,Product ID,Sales,Quantity,Discount,Profit
0,CA-2016-152156,Second Class,Consumer,South,FUR-BO-10001798,261.96,2,0.0,41.9136
1,CA-2016-152156,Second Class,Consumer,South,FUR-CH-10000454,731.94,3,0.0,219.582
2,CA-2016-138688,Second Class,Corporate,West,OFF-LA-10000240,14.62,2,0.0,6.8714
3,US-2015-108966,Standard Class,Consumer,South,FUR-TA-10000577,957.5775,5,0.0045,-383.031
4,US-2015-108966,Standard Class,Consumer,South,OFF-ST-10000760,22.368,2,0.002,2.5164


In [0]:
df['Order ID'] = df['Order ID'].astype('string')
df.head()

Unnamed: 0,Order ID,Ship Mode,Segment,Region,Product ID,Sales,Quantity,Discount,Profit
0,CA-2016-152156,Second Class,Consumer,South,FUR-BO-10001798,261.96,2,0.0,41.9136
1,CA-2016-152156,Second Class,Consumer,South,FUR-CH-10000454,731.94,3,0.0,219.582
2,CA-2016-138688,Second Class,Corporate,West,OFF-LA-10000240,14.62,2,0.0,6.8714
3,US-2015-108966,Standard Class,Consumer,South,FUR-TA-10000577,957.5775,5,0.0045,-383.031
4,US-2015-108966,Standard Class,Consumer,South,OFF-ST-10000760,22.368,2,0.002,2.5164


In [0]:
df.info()

In [0]:
df['Ship Mode'] = df['Ship Mode'].astype('string')
df.info()

In [0]:
df['Segment'] = df['Segment'].astype('string')
df.head()

Unnamed: 0,Order ID,Ship Mode,Segment,Region,Product ID,Sales,Quantity,Discount,Profit
0,CA-2016-152156,Second Class,Consumer,South,FUR-BO-10001798,261.96,2,0.0,41.9136
1,CA-2016-152156,Second Class,Consumer,South,FUR-CH-10000454,731.94,3,0.0,219.582
2,CA-2016-138688,Second Class,Corporate,West,OFF-LA-10000240,14.62,2,0.0,6.8714
3,US-2015-108966,Standard Class,Consumer,South,FUR-TA-10000577,957.5775,5,0.0045,-383.031
4,US-2015-108966,Standard Class,Consumer,South,OFF-ST-10000760,22.368,2,0.002,2.5164


In [0]:
df.info()

In [0]:
df['Region'] = df['Region'].astype('string')
df.head()

Unnamed: 0,Order ID,Ship Mode,Segment,Region,Product ID,Sales,Quantity,Discount,Profit
0,CA-2016-152156,Second Class,Consumer,South,FUR-BO-10001798,261.96,2,0.0,41.9136
1,CA-2016-152156,Second Class,Consumer,South,FUR-CH-10000454,731.94,3,0.0,219.582
2,CA-2016-138688,Second Class,Corporate,West,OFF-LA-10000240,14.62,2,0.0,6.8714
3,US-2015-108966,Standard Class,Consumer,South,FUR-TA-10000577,957.5775,5,0.0045,-383.031
4,US-2015-108966,Standard Class,Consumer,South,OFF-ST-10000760,22.368,2,0.002,2.5164


In [0]:
df.info()

In [0]:
df['Product ID'] = df['Product ID'].astype('string')
df.head()

Unnamed: 0,Order ID,Ship Mode,Segment,Region,Product ID,Sales,Quantity,Discount,Profit
0,CA-2016-152156,Second Class,Consumer,South,FUR-BO-10001798,261.96,2,0.0,41.9136
1,CA-2016-152156,Second Class,Consumer,South,FUR-CH-10000454,731.94,3,0.0,219.582
2,CA-2016-138688,Second Class,Corporate,West,OFF-LA-10000240,14.62,2,0.0,6.8714
3,US-2015-108966,Standard Class,Consumer,South,FUR-TA-10000577,957.5775,5,0.0045,-383.031
4,US-2015-108966,Standard Class,Consumer,South,OFF-ST-10000760,22.368,2,0.002,2.5164


In [0]:
df.info()

In [0]:
df.drop_duplicates()

Unnamed: 0,Order ID,Ship Mode,Segment,Region,Product ID,Sales,Quantity,Discount,Profit
0,CA-2016-152156,Second Class,Consumer,South,FUR-BO-10001798,261.9600,2,0.0000,41.9136
1,CA-2016-152156,Second Class,Consumer,South,FUR-CH-10000454,731.9400,3,0.0000,219.5820
2,CA-2016-138688,Second Class,Corporate,West,OFF-LA-10000240,14.6200,2,0.0000,6.8714
3,US-2015-108966,Standard Class,Consumer,South,FUR-TA-10000577,957.5775,5,0.0045,-383.0310
4,US-2015-108966,Standard Class,Consumer,South,OFF-ST-10000760,22.3680,2,0.0020,2.5164
...,...,...,...,...,...,...,...,...,...
9989,CA-2014-110422,Second Class,Consumer,South,FUR-FU-10001889,25.2480,3,0.0020,4.1028
9990,CA-2017-121258,Standard Class,Consumer,West,FUR-FU-10000747,91.9600,2,0.0000,15.6332
9991,CA-2017-121258,Standard Class,Consumer,West,TEC-PH-10003645,258.5760,2,0.0020,19.3932
9992,CA-2017-121258,Standard Class,Consumer,West,OFF-PA-10004041,29.6000,4,0.0000,13.3200


In [0]:
dbutils.fs.unmount('/mnt/lake1')

In [0]:
dbutils.fs.mount(
  source = "wasbs://silver@projectaccount123.blob.core.windows.net",
  mount_point = '/mnt/lake1',
  extra_configs = {
    'fs.azure.account.key.projectaccount123.blob.core.windows.net':'c9F/bns3QWX59Hx7IDic1V9iSaoB17Gk8y5f0npreZMROsyQ6ausbUeBBUmpGjdtKq17pnBwSOMq+AStiJsEnQ=='
  }
  
)

In [0]:
df = spark.createDataFrame(df)

In [0]:
df.coalesce(1).write.format("com.databricks.spark.csv").option("header",'true').mode("overwrite").save("/mnt/lake1/testoutput.csv")

In [0]:
dbutils.fs.ls('/mnt/lake1')

##2 - Spark Configs

With Spark configs, the Azure Data Lake Store settings can be specified per notebook. To keep things simple, the example below includes the credentials in plaintext. However, we strongly discourage you from storing secrets in plaintext. Instead, we recommend storing the credentials as [Databricks Secrets](https://docs.azuredatabricks.net/user-guide/secrets/index.html#secrets-user-guide).

**Note:** `spark.conf` values are visible only to the DataSet and DataFrames API. If you need access to them from an RDD, refer to the [documentation](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html#access-azure-data-lake-store-using-the-rdd-api).

In [0]:
%scala
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "<aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee>")
spark.conf.set("dfs.adls.oauth2.credential", "<NzQzY2QzYTAtM2I3Zi00NzFmLWI3MGMtMzc4MzRjZmk=>")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<ffffffff-gggg-hhhh-iiii-jjjjjjjjjjjj>/oauth2/token")

In [0]:
%fs ls adl://kpadls.azuredatalakestore.net/testing/

In [0]:
%scala
spark.read.parquet("dbfs:/mnt/my-datasets/datasets/iot/events").write.mode("overwrite").parquet("adl://kpadls.azuredatalakestore.net/testing/tmp/kp/v1")

In [0]:
%fs ls adl://kpadls.azuredatalakestore.net/testing/tmp/kp/v1