# Goal

The goal of this notebook is to play a bit with the workspace API in databricks to understand how to use it, and eventually, how to push files from github up to the workspace, and how to pull down from it.

The code will use the `requests` library in order to do the REST calls.

The API documentation is available [here](https://docs.azuredatabricks.net/api/latest/index.html).

Specifically, we will focus on the [Workspace API](https://docs.azuredatabricks.net/api/latest/workspace.html).


# Dependencies

This uses two libraries

- `python-dotenv` to load and manage access tokens (see authentication below)
- `requests` to send the REST calls

Both are installed into a conda environment by running the following command in the same directory as this notebook:

```
conda env create -f dbapi_conda.yml
```



# Step 1 - just reading the workspace

## Requirements

### Authentication

This will work through an example using a personal access token.

In order to use this approach, you need to generate a databricks personal access token.

To do so, do the following:

- Click on the `user` icon on the top right
- Click on the `User Settings` menu item
- Select the `Access Tokens` tab
- Click on the `Generate Access Token`
- Fill in the requested fields

Copy that token into a file called `.env` with the format:

```
DB_ACCESS_TOKEN=<THISISMYLONGSTRINGFROMDATABRICKS>
```

We will load this file later with the `python-dotenv` package to load the access token in such a way that it is not visible

We'll start with just reading info from the workspace.



# Working directory

First, we will make sure that we are in the appropriate working directory where the `.env` file is, so that it can use the appropriate token. This directory should be the same as the directory that this notebook is in.


In [34]:
import os
print('\n**** Current working directory ****\n%s\n' %(os.getcwd()))


**** Current working directory ****
C:\Users\jeremr\Documents\GitHub\MachineLearningSamples-PredictiveMaintenance\Code



# Load the token

Assuming that is where you expected it to be, load the `.env` file. It should result in an environment variable `DB_ACCESS_TOK` added to your environment variables.

In [35]:
from dotenv import load_dotenv
load_dotenv(verbose=True)


True

# Load the relevant libraries

In [36]:
import json
import base64
import requests

# Setup the variables describing your databricks env


In [37]:
DOMAIN = 'southcentralus.azuredatabricks.net'
TOKEN = str.encode(os.getenv('DB_ACCESS_TOK')) ## convert to bytes as well
BASE_URL = 'https://%s/api/2.0/' % (DOMAIN)

In [38]:
BASE_URL

'https://southcentralus.azuredatabricks.net/api/2.0/'

# Encode the header

In [39]:
my_header = {"Authorization": b"Basic " + base64.standard_b64encode(b"token:" + TOKEN)}

# Submit the request

In [40]:
response = requests.get(
    BASE_URL + "workspace/list",
    headers = my_header,
    json={
        "path": "/Users/jeremr@microsoft.com/"
    }
)

In [41]:
print(response)

<Response [200]>


# If successful

`response` should be have a 200 value. You can then view it by converting it to json.

The prior `get` command with `workspace/list` lists the libraries, files, and directories associated with that directory.

In [43]:
response.json()

{'objects': [{'object_type': 'LIBRARY',
   'path': '/Users/jeremr@microsoft.com/graphframes-0.6.0-spark2.3-s_2.11'},
  {'object_type': 'DIRECTORY', 'path': '/Users/jeremr@microsoft.com/Trash'},
  {'object_type': 'DIRECTORY', 'path': '/Users/jeremr@microsoft.com/Github'},
  {'object_type': 'DIRECTORY',
   'path': '/Users/jeremr@microsoft.com/Rstudio-helpers'},
  {'object_type': 'DIRECTORY',
   'path': '/Users/jeremr@microsoft.com/TrainingMaterials'}]}

# Make a directory

To make a directory, we just use a different REST method - we use a `post`, and we use `workspace/mkdirs`:
    
    

In [63]:
DB_HOME_DIR = "/Users/jeremr@microsoft.com"
DIR_NAME = "MachineLearningSamples-PredictiveMaintenance"

In [50]:
response = requests.post(
    BASE_URL + "workspace/mkdirs",
    headers = my_header,
    json={
        "path": '/'.join([DB_HOME_DIR, DIR_NAME])
    }
)

In [51]:
print(response)

<Response [200]>


# Import a notebook.

To import a file, we just use a different method. It is still a `post`, but now, instead of using "workspace/mkdirs", we use "workspace/import"

# A simple example

This is just an example from the API documentation. It demonstrates a simple example, where the notebook code has already been converted to base64. This is a good example to run just to make sure that you can import a very simple notebook.

In [80]:
r = requests.post(
    BASE_URL + "workspace/import",
    headers = my_header,
    json = {
      "path": '/'.join([DB_HOME_DIR, DIR_NAME, "new-notebook"]),
      "format": "SOURCE",
      "language": "SCALA",
      "content": "Ly8gRGF0YWJyaWNrcyBub3RlYm9vayBzb3VyY2UKcHJpbnQoImhlbGxvLCB3b3JsZCIpCgovLyBDT01NQU5EIC0tLS0tLS0tLS0KCg==",
      "overwrite": "false"
    }
)

In [109]:
print(r)
print(r.json())

<Response [200]>
{}


# More complex example

In this next example, I read in and process multiple notebooks, and send them via the `files` argument to `requests.post()`

This requires that the files listed in `filenames` are in the current directory.

There are some limitations on how big these files can be...

In [110]:
filenames = ['1_data_ingestion.ipynb',
 '2_feature_engineering.ipynb',
 '3_model_building.ipynb',
 '4_operationalization.ipynb']


In [108]:
responses = []
for fname in filenames:
    print(fname)
    remotepath = '/'.join([DB_HOME_DIR, DIR_NAME, fname])
    print(remotepath)
    with open(fname, 'rb') as f:
        data = f.read()
    ## post the request:
    r = requests.post(
                BASE_URL + "workspace/import",
                headers = my_header,
                data={
                    "path": remotepath,
                    "format": "JUPYTER",
                    "language": "PYTHON",
                    "overwrite": "true"              
                },
                files={
                    "content": data
                }
              )
    print(r)   
    responses.append(r)
    

1_data_ingestion.ipynb
/Users/jeremr@microsoft.com/MachineLearningSamples-PredictiveMaintenance/1_data_ingestion.ipynb
<Response [200]>
2_feature_engineering.ipynb
/Users/jeremr@microsoft.com/MachineLearningSamples-PredictiveMaintenance/2_feature_engineering.ipynb
<Response [200]>
3_model_building.ipynb
/Users/jeremr@microsoft.com/MachineLearningSamples-PredictiveMaintenance/3_model_building.ipynb
<Response [200]>
4_operationalization.ipynb
/Users/jeremr@microsoft.com/MachineLearningSamples-PredictiveMaintenance/4_operationalization.ipynb
<Response [200]>


<Response [200]>
