# Read and Write from Spark to SQL using the Apache Spark Connector for SQL Server and Azure SQL
A typical big data scenario a key usage pattern is high volume, velocity and variety data processing in Spark followed with batch or streaming writes to SQL for access to LOB applications. These usage patterns greatly benefit from a connector that utilizes key SQL optimizations and provides an efficient write to SQLServer master instance and SQL Server data pool in Big Data Clusters.

The Apache Spark Connector for SQL Server and Azure SQL provides an efficient write SQLServer master instance and SQL Server data pool in Big Data Clusters.

Usage
----
- Familiar Spark DataSource V1 interface
- Referenced by fully qualified name "com.microsoft.sqlserver.jdbc.spark"
- Use from supported Spark language bindings ( Python, Scala, Java, R)
- Optionally pass Bulk Copy parameters 

** Note : The image here may not be visible dues to markdown bug. Please change path here to full path to view the image.
<img src =
"../data-virtualization/MSSQL_Spark_Connector2.jpg" style="float: center;" alt="drawing" width="900">

More details
-----------

The Apache Spark Connector for SQL Server and Azure SQL, uses [SQL Server Bulk copy APIS](https://docs.microsoft.com/en-us/sql/connect/jdbc/using-bulk-copy-with-the-jdbc-driver?view=sql-server-2017#sqlserverbulkcopyoptions) to implement an efficient write to SQL Server. The connector is based on Spark Data source APIs and provides a familiar JDBC interface for access

The Sample
---------

The following sample uses the Apache Spark Connector for SQL Server and Azure SQL for read/write SQLServer master instance and SQL Server data pool in Big Data Clusters. The sample is divided into 2 parts. 
- Part 1 shows read/write to SQL Master instance and 
- Part 2 shows read/write to Data Pools in Big Data Cluster. 

In the sample we' ll 
- Read a file from HDFS and do some basic processing 
- In Part 1, we'll write the dataframe to SQL server table and then read the table to a dataframe .
- In Part 2, we'll write the dataframe to SQL Server data pool external table and then read it back to a spark data frame. 




    

## PreReq
-------
- Download [AdultCensusIncome.csv]( https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv ) to your local machine.  Upload this file to hdfs folder named *spark_data*. 
-  The sample uses a SQL database  *connector_test_db*,  user  *connector_user* with password *password123!#* and datasource  *connector_ds*. The database, user/password and datasource need to be created before running the full sample. Refer **mssql_spark_connector_user_creation.ipynb** on steps to create this user.

# Read CSV into a data frame
In this step we read the CSV into a data frame and do some basic cleanup steps. 




In [3]:
#spark = SparkSession.builder.getOrCreate()
sc.setLogLevel("INFO")

#Read a file and then write it to the SQL table
datafile = "/spark_data/AdultCensusIncome.csv"
df = spark.read.format('csv').options(header='true', inferSchema='true', ignoreLeadingWhiteSpace='true', ignoreTrailingWhiteSpace='true').load(datafile)
df.show(5)


#Process this data. Very simple data cleanup steps. Replacing "-" with "_" in column names
columns_new = [col.replace("-", "_") for col in df.columns]
df = df.toDF(*columns_new)
df.show(5)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|education|education-num|    marital-status|       occupation| relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| <=50K|
| 38|         Private|215646|  HS-grad|            9|          Divorced|Handlers-cleaners|Not-i

# (PART 1) Write and READ to SQL Table
- Write dataframe to SQL table to SQL Master
- Read SQL Table to Spark dataframe

In [5]:
#Write from Spark to SQL table using the Apache Spark Connector for SQL Server and Azure SQL
print("Use Apache Spark Connector for SQL Server and Azure SQL to write to master SQL instance ")

servername = "jdbc:sqlserver://master-0.master-svc"
dbname = "connector_test_db"
url = servername + ";" + "databaseName=" + dbname + ";"

dbtable = "AdultCensus_test"
user = "connector_user"
password = "password123!#" # Please specify password here

#com.microsoft.sqlserver.jdbc.spark

try:
  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", dbtable) \
    .option("user", user) \
    .option("password", password) \
    .save()
except ValueError as error :
    print("Connector write failed", error)

print("Connector write(overwrite) succeeded  ")




FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Use Apache Spark Connector for SQL Server and Azure SQL to write to master SQL instance 
Connector write(overwrite) succeeded

In [6]:
#Use mode as append
try:
  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("append") \
    .option("url", url) \
    .option("dbtable", dbtable) \
    .option("user", user) \
    .option("password", password) \
    .save()
except ValueError as error :
    print("Connector write failed", error)

print("Connector write(append) succeeded  ")

MSSQL Connector write(append) succeeded

In [7]:
#Read from SQL table using the Apache Spark Connector for SQL Server and Azure SQL
print("read data from SQL server table  ")
jdbcDF = spark.read \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", dbtable) \
        .option("user", user) \
        .option("password", password).load()

jdbcDF.show(5)

read data from SQL server table  
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|education|education_num|    marital_status|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_per_week|native_country|income|
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| <=50K|
| 38|         Private|215646|  HS-grad|            9|        

Connector also supports write when the destination table contains multiple computed columns. We will now write the same data to a table that has one additional column which is computed by other columns.
- Write dataframe to SQL table with computed column
- Read SQL Table to Spark dataframe

In [6]:
# Append to table with computed column: net_capital_gain = capital_gain - capital_loss

dbtable = "AdultCenses_computed_col"

try:
  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("append") \
    .option("url", url) \
    .option("dbtable", dbtable) \
    .option("user", user) \
    .option("password", password) \
    .save()
except ValueError as error :
    print("MSSQL Connector write failed", error)

print("MSSQL Connector write(append) succeeded  ")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

MSSQL Connector write(append) succeeded

In [7]:
#Read from SQL table using MSSQ Connector
print("read data from SQL server table  ")
jdbcDF = spark.read \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", dbtable) \
        .option("user", user) \
        .option("password", password).load()

jdbcDF.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

read data from SQL server table  
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+----------------+
|age|       workclass|fnlwgt|education|education_num|    marital_status|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_per_week|native_country|income|net_capital_gain|
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+----------------+
| 39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|            2174|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| 

When schemaCheckEnabled is false, we can write to the destination table which has less column than dataframe.
- Write dataframe to SQL table when strict dataframe and sql table schema check set to false
- Read SQL Table to Spark dataframe

In [8]:
# Append to table when schemaCheckEnabled set to false
# age column not exist in sql table

dbtable = "AdultCenses_schema_check"

try:
  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("append") \
    .option("url", url) \
    .option("dbtable", dbtable) \
    .option("user", user) \
    .option("password", password) \
    .option("schemaCheckEnabled", "false") \
    .save()
except ValueError as error :
    print("MSSQL Connector write failed", error)

print("MSSQL Connector write(append) succeeded  ")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

MSSQL Connector write(append) succeeded

In [9]:
#Read from SQL table using MSSQ Connector
print("read data from SQL server table  ")
jdbcDF = spark.read \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", dbtable) \
        .option("user", user) \
        .option("password", password).load()

jdbcDF.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

read data from SQL server table  
+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+----------------+
|       workclass|fnlwgt|education|education_num|    marital_status|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_per_week|native_country|income|net_capital_gain|
+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+----------------+
|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|            2174|
|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| <=50K|              

# (PART 2) Write and READ to Data Pool external Tables in Big Data Cluster
- Write dataframe to SQL external table in Data Pools in Big Data Cluste
- Read SQL external Table to Spark dataframe

In [8]:
#Write from Spark to SQL table using the Apache Spark Connector for SQL Server and Azure SQL
print("Use MSSQL connector to write to master SQL instance ")

datapool_table = "AdultCensus_DataPoolTable"
datasource_name = "connector_ds"


try:
  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", datapool_table) \
    .option("user", user) \
    .option("password", password) \
    .option("dataPoolDataSource",datasource_name)\
    .save()
except ValueError as error :
    print("Connector write failed", error)

print("Connector write(overwrite) to data pool external table succeeded")


Use MSSQL connector to write to master SQL instance 
MSSQL Connector write(overwrite) to data pool external table succeeded

In [9]:
try:
  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("append") \
    .option("url", url) \
    .option("dbtable", datapool_table) \
    .option("user", user) \
    .option("password", password) \
    .option("dataPoolDataSource",datasource_name)\
    .save()
except ValueError as error :
    print("Connector write failed", error)

print("Connector write(append) to data pool external table succeeded")

MSSQL Connector write(append) to data pool external table succeeded

In [10]:
#Read from SQL table using the Apache Spark Connector for SQL Server and Azure SQL
print("read data from SQL server table  ")
jdbcDF = spark.read \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", datapool_table) \
        .option("user", user) \
        .option("password", password)\
        .load()

jdbcDF.show(5)

print("Connector read from data pool external table succeeded")

read data from SQL server table  
+---+----------------+------+------------+-------------+------------------+-------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|   education|education_num|    marital_status|   occupation| relationship| race|   sex|capital_gain|capital_loss|hours_per_week|native_country|income|
+---+----------------+------+------------+-------------+------------------+-------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 36|         Private| 99374|Some-college|           10|          Divorced| Craft-repair|Not-in-family|White|  Male|           0|           0|            40| United-States| <=50K|
| 27|         Private|248402|   Bachelors|           13|     Never-married| Tech-support|    Unmarried|Black|Female|           0|           0|            40| United-States| <=50K|
| 46|Self-emp-not-inc|277946|  Assoc-acdm|           12|         S