d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# 3.3 Accessing Data

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this notebook you:<br>

* Read data from a BLOB store
* Read data in serial from JDBC
* Read data in parallel from JDBC

In [0]:
%run ../Includes/Classroom-Setup

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) DBFS Mounts and S3

Amazon Simple Storage Service (S3) is the backbone of Databricks workflows.  S3 offers data storage that easily scales to the demands of most data applications and, by colocating data with Spark clusters, Databricks quickly reads from and writes to S3 in a distributed manner.

The Databricks File System, or DBFS, is a layer over S3 that allows you to [mount S3 buckets](https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3), making them available to other users in your workspace and persisting the data after a cluster is shut down.

In Azure Databricks, DBFS is backed by the Azure Blob Store. More documentation can be found [here](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage).

In the first lesson, you uploaded data using the Databricks user interface.  This can be done by clicking Data on the left-hand side of the screen.  Here we'll be mounting data.

Define your AWS credentials.  Below are defined read-only keys, the name of an AWS bucket, and the mount name we will be referring to in DBFS.

For getting AWS keys, take a look at <a href="https://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html" target="_blank"> take a look at the AWS documentation

In [0]:
%python
ACCESS_KEY = "AKIAJBRYNXGHORDHZB4A"
# Encode the Secret Key to remove any "/" characters
SECRET_KEY = "a0BzE1bSegfydr3%2FGE3LSPM6uIV5A4hOUfpH8aFF".replace("/", "%2F")
AWS_BUCKET_NAME = "davis-dsv1071/data/"
MOUNT_NAME = "/mnt/davis-tmp"

-sandbox

Now mount the bucket [using the template provided in the docs.](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html#mounting-an-s3-bucket)

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The code below includes error handling logic to handle the case where the mount is already mounted.

In [0]:
%python
try:
  dbutils.fs.mount("s3a://{}:{}@{}".format(ACCESS_KEY, SECRET_KEY, AWS_BUCKET_NAME), MOUNT_NAME)
except:
  print("""{} already mounted. Unmount using `dbutils.fs.unmount("{}")` to unmount first""".format(MOUNT_NAME, MOUNT_NAME))

Next, explore the mount using `%fs ls` and the name of the mount.

In [0]:
%fs ls /mnt/davis-tmp

path,name,size
dbfs:/mnt/davis-tmp/dataframes/,dataframes/,0
dbfs:/mnt/davis-tmp/fire-calls/,fire-calls/,0
dbfs:/mnt/davis-tmp/fire-incidents/,fire-incidents/,0


In practice, always secure your AWS credentials.  Do this by either maintaining a single notebook with restricted permissions that holds AWS keys, or delete the cells or notebooks that expose the keys. After a cell used to mount a bucket is run, you can access the data in this mount point in any notebook or any cluster in Databricks, and share the mount between colleagues.

In [0]:
%fs mounts

mountPoint,source,encryptionType
/mnt/training,s3a://databricks-corp-training/common,
/mnt/davis,s3a://davis-dsv1071/data,
/databricks-datasets,databricks-datasets,sse-s3
/databricks/mlflow-tracking,databricks/mlflow-tracking,sse-s3
/databricks-results,databricks-results,sse-s3
/mnt/davis-tmp,s3a://davis-dsv1071/data/,
/databricks/mlflow-registry,databricks/mlflow-registry,sse-s3
/,DatabricksRoot,sse-s3


You now have access to an unlimited store of data.  This is a read-only S3 bucket.  If you create and mount your own, you can write to it as well.

Now unmount the data.

In [0]:
%python
dbutils.fs.unmount("/mnt/davis-tmp")

/mnt/davis-tmp has been unmounted.
Out[4]: True

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Serial JDBC Reads

Java Database Connectivity (JDBC) is an application programming interface (API) that defines database connections in Java environments.  Spark is written in Scala, which runs on the Java Virtual Machine (JVM).  This makes JDBC the preferred method for connecting to data whenever possible. Hadoop, Hive, and MySQL all run on Java and easily interface with Spark clusters.

Databases are advanced technologies that benefit from decades of research and development. To leverage the inherent efficiencies of database engines, Spark uses an optimization called predicate push down.  **Predicate push down uses the database itself to handle certain parts of a query (the predicates).**  In mathematics and functional programming, a predicate is anything that returns a Boolean.  In SQL terms, this often refers to the `WHERE` clause.  Since the database is filtering data before it arrives on the Spark cluster, there's less data transfer across the network and fewer records for Spark to process.  Spark's Catalyst Optimizer includes predicate push down communicated through the JDBC API, making JDBC an ideal data source for Spark workloads.

Run the cell below to confirm you are using the right driver.

In [0]:
%scala
Class.forName("org.postgresql.Driver")

First we need to create the JDBC String.

In [0]:
%sql
DROP TABLE IF EXISTS twitterJDBC;

CREATE TABLE IF NOT EXISTS twitterJDBC
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "org.postgresql.Driver",
  url "jdbc:postgresql://server1.databricks.training:5432/training",
  user "readonly",
  password "readonly",
  dbtable "training.Account"
)

We now have a twitter table.

In [0]:
%sql
SELECT * FROM twitterJDBC where  location!="NULL" LIMIT 10

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
399221347,imSyue,Kuala Lumpur,181,499,,317827597565,3153
703875321561534464,Pinkynoora_17,"Lamongan, Indonesia",100,13,,317827597566,3153
896816295081115648,BtsarmyAuca,Bolivia,348,156,Si estamos juntos incluso el decierto se convierte en Mar @BTS_twt 💟 ARMY,317827597567,3153
18025631,SuperFreshSteph,"Boston, MA",2971,2161,"Product of Conflicting Realities— Passion for Problem Solving, Data, Rap Music & Merlot— MBA— Cat Mom— “all I want is everything”",317827597568,3153
828479655883661312,3azooz121222,"جدة, المملكة العربية السعودية",10594,11782,بسيط،اجتماعي،منعزل،أذكى،متغابي،محبوب مجنون،متهور،متأني،متسرع هادي،فوضوي،احب،طيوري،مربي،ومنتج اعشق الطبخ https://z3ozo.sarahah.com ✈️📽🍿 تغريداتي بالمفضله,317827597569,3153
128431366,caapatta,Paraná,872,2268,@SantosFC,317827597572,3153
830194412848287745,HeySraGrier,Havana oh na na,3174,1573,"Fan (n.); A person who will love, protect and support your idols at all costs",317827597573,3153
569454066,LyssBailey1,,244,327,Bloomsburg University ‘20,317827597575,3153
77485311,sexy_aunty,India,2902,17955,Visit site for spicy and sexy stuff like top videos. This is a blog for free and un-limited.,317827597577,3153
18202608,quietprofanity,"Monmouth County, NJ",155,148,I like books and comics and movies. #RESIST My stomach hurts.,317827597578,3153


Add a subquery to `dbtable`, which pushes the predicate to JDBC to process before transferring the data to your Spark cluster.

In [0]:
%sql
DROP TABLE IF EXISTS twitterPhilippinesJDBC;

CREATE TABLE twitterPhilippinesJDBC
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "org.postgresql.Driver",
  url "jdbc:postgresql://server1.databricks.training:5432/training",
  user "readonly",
  password "readonly",
  dbtable "(SELECT * FROM training.Account WHERE location = 'Philippines') as subq"
)

In [0]:
%sql
SELECT * FROM twitterPhilippinesJDBC LIMIT 10

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
741082009137618945,BalinoZafe,Philippines,87,40,,335007466410,3153
812388884,jeonshookted,Philippines,642,468,"Love yourself, love myself, peace✌| I don't tolerate fanwars on my timeline u r free to go [lowkey:reader] ot7🔥",360777253361,3153
151373723,jennispocket,Philippines,540,485,"4/5 MARCH 21, 2015 #OTRAMNL",360777253599,3153
835492079476203520,jeco_0114,Philippines,167,140,,343597384676,3153
30807063,ch3rrryanni3,Philippines,207,1002,ELF. Inspirit. Once. 成种. 성종. WOOJONG. 우종 남쫑. #우횬이형♡ #우현이형,343597384856,3153
465440937,bobbyuntaehyung,Philippines,2400,2837,"I used to be normal but then 1D, KPOP & KDRAMA happened. ONE DIRECTION, 빅뱅, 방탄소년단, EXO is KINGS. BISEXUAL, HARD STAN AF, SO NSFW. 🔞 Block when unfollowed.",369367188516,3153
221730134,daryloopy,Philippines,600,1054,"in a world of multiple and beautiful colors, I can only see black",369367188858,3153
3008161818,secretlovesongx,Philippines,199,157,"Calm mind, happy heart 😌🍃 x selenator x mixer",369367189285,3153
864246595,DhevielLhy,Philippines,28,11,,343597386111,3153
335212089,haerim27,Philippines,385,300,JYP Nation is ❤ Kim So Hyun is ❤,343597386521,3153


## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Parallel JDBC Reads

In [0]:
%sql
SELECT min(userID) as minID, max(userID) as maxID from twitterJDBC

minID,maxID
509,959519629566672896


In [0]:
%sql
DROP TABLE IF EXISTS twitterParallelJDBC;

CREATE TABLE IF NOT EXISTS twitterParallelJDBC
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "org.postgresql.Driver",
  url "jdbc:postgresql://server1.databricks.training:5432/training",
  user "readonly",
  password "readonly",
  dbtable "training.Account",
  partitionColumn '"userID"',
  lowerBound 2591,
  upperBound 951253910555168768,
  numPartitions 8
)

In [0]:
%sql
SELECT * from twitterParallelJDBC WHERE LOCATION!="NULL" LIMIT 10

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
399221347,imSyue,Kuala Lumpur,181,499,,317827597565,3153
18025631,SuperFreshSteph,"Boston, MA",2971,2161,"Product of Conflicting Realities— Passion for Problem Solving, Data, Rap Music & Merlot— MBA— Cat Mom— “all I want is everything”",317827597568,3153
128431366,caapatta,Paraná,872,2268,@SantosFC,317827597572,3153
569454066,LyssBailey1,,244,327,Bloomsburg University ‘20,317827597575,3153
77485311,sexy_aunty,India,2902,17955,Visit site for spicy and sexy stuff like top videos. This is a blog for free and un-limited.,317827597577,3153
18202608,quietprofanity,"Monmouth County, NJ",155,148,I like books and comics and movies. #RESIST My stomach hurts.,317827597578,3153
1535955007,pcyrads,#TOKEB ft 6ERONDON9,3142,3982,"( RP/NSFW ) South Korean rapper, singer, actor and @rlkimve's fulltime lover♡. Better known by the mononym Chanyeol. (Chanyoung & Taeyeol)'s handsome daddy.",317827597581,3153
1508545082,justinshunty,est. 2010,319,20518,are u expecting me to follow you back?,317827597582,3153
110971001,documentavi,Dublin,234,58,RTs/follows are not to be read unfailingly as endorsements. I sometimes post material with which I do not agree.,317827597585,3153
483881718,popkornism,Adventure Time,714,1896,Alex Turner's my daddy. Lana Del Rey is my mother. iKON is my bestest friend.,317827597586,3153


###Compare the performance of the serial and parallel reads from the database.<br>
**timeit** — Measure execution time of code snipet.<br>

In [0]:
%python
%timeit sql("SELECT * from twitterJDBC").describe()

40.2 s ± 1.53 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [0]:
%python
%timeit sql("SELECT * from twitterParallelJDBC").describe()

38.8 s ± 2.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


For additional options [see the Spark docs.](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>