# Create an init script for setting up RStudio and installing R packages

- Author: Jeremy Reynolds
- Date last modified: 2018-08-13
- license: MIT

The goal of this notebook is to place a script in an appropriate location so that it will be run as an init script for a particular databricks cluster. This script installs `RStudio` and (potentially) additional R packages.

The init script is based off of the documentation [here](https://docs.azuredatabricks.net/spark/latest/sparkr/rstudio.html), but it also adds a feature to install additional R packages (the noteboook in its default state updates `Rcpp` and installs `sparklyr`).

This notebook only needs to be run once to generate the init script for a given cluster, and then the init script that is generated will be run each time the cluster is started. If you decide you want additional `R` packages installed, then simply update the `rpkgnames` variable in the second cell, and run the notebook again to update the init script.

## Requirements

See the [documentation](https://docs.azuredatabricks.net/spark/latest/sparkr/rstudio.html#requirements).

## Instructions on using this notebook

1. Import this notebook into your databricks environment as a python notebook.
2. Make sure the notebook is attached to an appropriate cluster (see Requirements)
3. Update the `clustername` variable in the second cell
4. Execute the notebook
5. Restart the cluster

## Instructions on accessing RStudio

See the instructions [here](https://docs.azuredatabricks.net/spark/latest/sparkr/rstudio.html#use-rstudio-server-open-source) for more details. This list is just a rehash of that page without images.

1. Click on the `Clusters` button on the left side of the workspace
2. Select the cluster you have installed RStudio on.
3. Click the Apps tab.
4. In the Apps tab, click the Set up RStudio button. This generates a one-time password for you. Click the show link to display it and copy the password.
5. Click the Open RStudio UI link to open the UI in a new tab. 
6. In the new browser window, enter your username and password to sign in.


## Notes

- If you restart a databricks cluster with an RStudio up and running, you may have to log out of the workspace and back in order for the one-time password to be updated in the apps tab of the cluster.
- The script does have potential to add a decent chunk of time to the startup time of the cluster.

In [2]:
## Set up variables - this should be the only cell that needs to be modified
## Set up R studio
clustername = "jr-rstudio-tst2" # This is the name of your databricks cluster - for placing the init script.

rsversion = "1.1.456"
addRinstalls = True
rpkgnames = ['Rcpp','sparklyr']# really should care about versions...


In [3]:
## store the path to the script you are going to write:
scriptpath = "/databricks/init/"+clustername+"/rstudio-install.sh"
## generate the base script to install RStudio - based on https://docs.azuredatabricks.net/spark/latest/sparkr/rstudio.html
script = """
sudo apt-get install -y gdebi-core alien
cd /tmp
sudo wget https://download2.rstudio.org/rstudio-server-"""+rsversion+"""-amd64.deb
sudo gdebi -n rstudio-server-"""+rsversion+"""-amd64.deb
sudo rstudio-server restart
"""
## print it out just to make sure of formatting:
print(script)


In [4]:
## This is one way of how you could manage specific versions
## it would just require much manual dependency management.
## Left here as a reminder for later.
# for old school way:
#Rpkgs2install = ['Rcpp_0.12.18.tar.gz',
#               'sparklyr_0.8.4.tar.gz']
## old school way - would need to manage dependencies manually here...
## could use this to manage dependency versions...
# def writescriptold(pkgurl, urlpath = "https://cran.r-project.org/src/contrib/", rcmd = "/usr/bin/R"):
#   import os
#   script = """
#   sudo wget """+os.path.join(urlpath,pkgurl)+"""
#   sudo """+rcmd+""" CMD INSTALL """+pkgurl+"""
#   """
#   return(script)

In [5]:
# function to install a package from CRAN
# use a call to install.packages() so that it does dependency management
def writescript(pkgname, rcmd = "/usr/bin/R", repos = "http://cran.us.r-project.org"):
  script = """
sudo """+rcmd+""" --vanilla -e "install.packages('"""+pkgname+"""', repos='"""+repos+"""')"
  """
  return(script)
  
## if you want to do R package installs
if addRinstalls:
  ## create the R calls for each package in rpkgnames
  Rinstallcmds = [writescript(p) for p in rpkgnames]
  ## join them appropriately to 1 string
  print(''.join(Rinstallcmds))
  ## add to the base script
  script = script + ''.join(Rinstallcmds)


In [6]:
## make sure the script looks fine.
print(script)

In [7]:
## store the script variable as a file in the appropriate location
dbutils.fs.put(scriptpath, script, True)
## print out the file to make sure it is written appropriately and matches the syntax above.
print(dbutils.fs.head(scriptpath))
print("""
******
Done!
******
""")