d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Reading Data - JDBC Connections

**Technical Accomplishments:**
- Read Data from Relational Database

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [0]:
%run "../Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading from JDBC

Working with a JDBC data source is significantly different than any of the other data sources.
* Configuration settings can be a lot more complex.
* Often required to "register" the JDBC driver for the target database.
* We have to juggle the number of DB connections.
* We have to instruct Spark how to partition the data.

**NOTE:** The database is read-only
* For security reasons. 
* The notebook does not demonstrate writing to a JDBC database.

* For examples of writing via JDBC, see 
  * <a href="https://docs.azuredatabricks.net/spark/latest/data-sources/sql-databases.html" target="_blank">Connecting to SQL Databases using JDBC</a>
  * <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases" target="_blank">JDBC To Other Databases</a>

In [0]:
%scala

// Ensure that the driver class is loaded. 
// Seems to be necessary sometimes.
Class.forName("org.postgresql.Driver") 

In [0]:
tableName = "training.people_1m"
jdbcURL = "jdbc:postgresql://54.213.33.240/training"

# Username and Password w/read-only rights
connProperties = {
  "user" : "readonly",
  "password" : "readonly"
}

# And for some consistency in our test to come
spark.conf.set("spark.sql.shuffle.partitions", "8")

In [0]:
exampleOneDF = spark.read.jdbc(
  url=jdbcURL,                # the JDBC URL
  table=tableName,            # the name of the table
  properties=connProperties)  # the connection properties

exampleOneDF.printSchema()

**Question:** Compared to CSV and even Parquet, what is missing here?

**Question:** Based on the answer to the previous question, what are the ramifications of the missing...?

**Question:** Before you run the next cell, what's your best guess as to the number of partitions?

In [0]:
print("Partitions: " + str(exampleOneDF.rdd.getNumPartitions()) )

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) That's not Parallelized

Let's try this again, and this time we are going to increase the number of connections to the database.

** *Note:* ** *If any one of these properties is specified, they must all be specified:*
* `partitionColumn` - the name of a column of an integral type that will be used for partitioning.
* `lowerBound` - the minimum value of columnName used to decide partition stride.
* `upperBound` - the maximum value of columnName used to decide partition stride
* `numPartitions` - the number of partitions/connections

To quote the <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases" target="_blank">Spark SQL, DataFrames and Datasets Guide</a>:
> These options must all be specified if any of them is specified. They describe how to partition the table when reading in parallel from multiple workers. `partitionColumn` must be a numeric column from the table in question. Notice that `lowerBound` and `upperBound` are just used to decide the partition stride, not for filtering the rows in a table. So all rows in the table will be partitioned and returned. This option applies only to reading.

In [0]:
jdbcURL = "jdbc:postgresql://54.213.33.240/training"

exampleTwoDF = spark.read.jdbc(
  url=jdbcURL,                  # the JDBC URL
  table=tableName,              # the name of the table
  column="id",                  # the name of a column of an integral type that will be used for partitioning.
  lowerBound=1,                 # the minimum value of columnName used to decide partition stride.
  upperBound=200000,            # the maximum value of columnName used to decide partition stride
  numPartitions=8,              # the number of partitions/connections
  properties=connProperties)    # the connection properties

Let's start with checking how many partitions we have (it should be 8)

In [0]:
print("Partitions: " + str(exampleTwoDF.rdd.getNumPartitions()) )

But how many records were loaded per partition?

Using the utility method we created above we can print the per-partition count.

In [0]:
printRecordsPerPartition(exampleTwoDF)

That might be a problem... notice how many records are in the last partition?

**Question:** What are the performance ramifications of leaving our partitions like this?

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) That's Not [Well] Distributed

And this is one of the little gotchas when working with JDBC - to properly specify the stride, we need to know the minimum and maximum value of the IDs.

In [0]:
from pyspark.sql.functions import *

minimumID = (exampleTwoDF
  .select(min("id"))   # Compute the minimum ID
  .first()["min(id)"]  # Extract as an integer
)
maximumID = (exampleTwoDF
  .select(max("id"))   # Compute the maximum ID
  .first()["max(id)"]  # Extract as an integer
)
print("Minimum ID: " + str(minimumID))
print("Maximum ID: " + str(maximumID))
print("-"*80)

Now, let's try this one more time... this time with the proper stride:

In [0]:
exampleThree = spark.read.jdbc(
  url="jdbc:postgresql://54.213.33.240/training", # the JDBC URL
  table=tableName,                                # the name of the table
  column="id",                                    # the name of a column of an integral type that will be used for partitioning.
  lowerBound=minimumID,                           # the minimum value of columnName used to decide partition stride.
  upperBound=maximumID,                           # the maximum value of columnName used to decide partition stride
  numPartitions=8,                                # the number of partitions/connections
  properties=connProperties)                      # the connection properties

printRecordsPerPartition(exampleThree)
print("-"*80)

And of course we can view that data here:

In [0]:
display(exampleThree)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "../Includes/Classroom-Cleanup"

## Next Steps

* [Reading Data #1 - CSV]($./Reading Data 1 - CSV)
* [Reading Data #2 - Parquet]($./Reading Data 2 - Parquet)
* [Reading Data #3 - Tables]($./Reading Data 3 - Tables)
* [Reading Data #4 - JSON]($./Reading Data 4 - JSON)
* [Reading Data #5 - Text]($./Reading Data 5 - Text)
* Reading Data #6 - JDBC
* [Reading Data #7 - Summary]($./Reading Data 7 - Summary)

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>