# Apache Spark Connector for SQL Server and Azure SQL with Integrated AD Auth

This sample shows how to use the Apache Spark Connector for SQL Server and Azure SQL with integrated AD Auth when using principal and keytab instead of username/password.  

## PreReq
-------
- SQL Server 2019 big data cluster is deployed with AD
- Have access to AD controller to create keytabs that we will use in this sample. 
- Download [AdultCensusIncome.csv]( https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv ) to your local machine.  Upload this file to hdfs folder named *spark_data*. 
- The sample uses a SQL database  *spark_sql_db* to create/update tables. The database needs to be created before the sample is run.
    

# Creating KeyTab file
The following section shows how to generate principal and keytab. This assumes you have a SS19 Big Data Cluster installed with Windows AD contoller for domain AZDATA.LOCAL. One of the users is testusera1@AZDATA.LOCAL and the user is part of Domain Admin group.

##  Create KeyTab file using ktpass
1. Login to the Windows AD controller with testusera1 credentials.
2. Open command prompt in Administrator mode.
3. Use ktpass to create a key tab. Refer [here](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/ktpass) for documentation on using ktpass. 

```sh
ktpass -out testusera1.keytab -mapUser testusera1@AZDATA.LOCAL -pass <testusera1 password> -mapOp set +DumpSalt -crypto AES256-SHA1 -ptype KRB5_NT_PRINCIPAL -princ testusera1@AZDATA.LOCAL
```

Note that principal name in ktpass is case sensitive. The command above generates a keytab file named testusera1.keytab. Transfer this file to hdfs folder in Big Data Cluster. In this sample we transfer the file to /user/testusera1/testusera1.keytab

## Create KeyTab file using kinit

If you are on a linux machine kinit can be used as follows to create keytab. Note that you linux machine shoud be connected to the domain controler.

``` sh
ktutil
ktutil : add_entry -password -p testusera1@AZDATA.LOCAL -k 1 -e arcfour-hmac-md5
Password for testusera1@myDomain:
ktutil : add_entry -password -p testusera1@AZDATA.LOCAL -k 1 -e des-cbc-md4
ktutil : wkt testusera1.keytab 
```

``` sh
## Check if keytab generated properly. Any error implies that keytab is not generated right.
kinit -kt testusera1.keytab  testusera1@AZDATA.LOCAL
```

Load Keytab to HDFS for use

```sh
hadoop fs -mkdir -p /user/testusera1/
hadoop fs -copyFromLocal -f testusera1.keytab  /user/testusera1/testusera1.keytab
```



 

# Create a database user

``` sql
IF NOT EXISTS (select name from sys.server_principals where name='AZDATA.LOCAL\testusera1')
BEGIN
    CREATE LOGIN [AZDATA.LOCAL\testusera1] FROM WINDOWS
END

ALTER SERVER ROLE dbcreator ADD MEMBER [AZDATA.LOCAL\testusera1]
GRANT VIEW SERVER STATE to  [AZDATA.LOCAL\testusera1]

# Create a database named "spark_mssql_db"
IF NOT EXISTS (SELECT * FROM sys.databases WHERE name = N'spark_mssql_db')
                CREATE DATABASE spark_mssql_db
```

# Create Data Pool user

```
-- To create external tables in data pools
grant alter any external data source to [aris\testuser];

-- To create external table
grant create table to [aris\testuser];
grant alter any schema to [aris\testuser];

ALTER ROLE [db_datareader] ADD MEMBER [aris\testuser]
ALTER ROLE [db_datawriter] ADD MEMBER [aris\testuser]
```

```
CREATE EXTERNAL DATA SOURCE connector_ds  WITH (LOCATION = 'sqldatapool://controller-svc/default');
EXECUTE('USE spark_mssql_db; CREATE EXTERNAL TABLE [dummy3] ([number] int, [word] nvarchar(2048)) WITH (DATA_SOURCE = connector_ds, DISTRIBUTION = ROUND_ROBIN)')

-- Create a login in data pools and Provide right permissions to this user
EXECUTE( ' Use spark_mssql_db; CREATE LOGIN [aris\testusera1]  FROM WINDOWS ' )  AT  DATA_SOURCE connector_ds;

EXECUTE( ' Use spark_mssql_db; CREATE USER  [aris\testusera1] ; ALTER ROLE [db_datareader] ADD MEMBER [aris\testusera1];  ALTER ROLE [db_datawriter] ADD MEMBER [aris\testusera1] ;')  AT  DATA_SOURCE connector_ds;

```

# Configure Spark application to point to the key tab file
Here we configure spark to use the keytab file once the keytab is created and uploaded to HDFS (/user/testusera1/testusera1.keytab). 
Note the usage of "spark.files" : "/user/testusera1/testusera1.keytab". As a result of this configuration Spark driver distributes the file to all executors. 

Run the cell below to start the spark application.


In [8]:
%%configure -f
{"conf": {
    "spark.files" : "/user/testusera1/testusera1.keytab",
    "spark.executor.memory": "4g",
    "spark.driver.memory": "4g",
    "spark.executor.cores": 2,
    "spark.driver.cores": 1,
    "spark.executor.instances": 4
        }
}

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
12,application_1581458669418_0041,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
6,application_1581458669418_0035,pyspark,idle,Link,Link,✔


# Read CSV into a data frame
In this step we read the CSV into a data frame. This dataframe would then be written to SQL table using MSSQL Spark Connector 




In [5]:
#spark = SparkSession.builder.getOrCreate()
sc.setLogLevel("INFO")

#Read a file and then write it to the SQL table
datafile = "/spark_data/AdultCensusIncome.csv"
df = spark.read.format('csv').options(header='true', inferSchema='true', ignoreLeadingWhiteSpace='true', ignoreTrailingWhiteSpace='true').load(datafile)
df.show(5)


#Process this data. Very simple data cleanup steps. Replacing "-" with "_" in column names
columns_new = [col.replace("-", "_") for col in df.columns]
df = df.toDF(*columns_new)
df.show(5)


Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
7,application_1581458669418_0036,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|education|education-num|    marital-status|       occupation| relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| <=50K|
| 38|         Private|215646|  HS-grad|            9|          Divorced|Handlers-cleaners|Not-i

# (Part 1) Write and READ to/from SQL Table ( using Integrated Auth)
- Write dataframe to SQL table to Master instance
- Read SQL Table to Spark dataframe

In both scenarions here we use integrated auth with principal\keytab file rather than username\password of the user.

In [7]:
#Write from Spark to SQL table using Apache Spark Connector for SQL Server and Azure SQL
print("Apache Spark Connector for SQL Server and Azure SQL write(overwrite) start ")

servername = "jdbc:sqlserver://master-p-svc:1433"
dbname = "spark_mssql_db"
security_spec = ";integratedSecurity=true;authenticationScheme=JavaKerberos;"
url = servername + ";" + "databaseName=" + dbname + security_spec

dbtable = "AdultCensus_test"
principal = "testusera1@AZDATA.LOCAL"
keytab = "/user/testusera1/testusera1.keytab" 

try:
  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", dbtable) \
    .option("principal", principal) \
    .option("keytab", keytab) \
    .save()
except ValueError as error :
    print("Apache Spark Connector for SQL Server and Azure SQL write(overwrite) failed", error)

print("Apache Spark Connector for SQL Server and Azure SQL write(overwrite) done  ")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Use MSSQL connector to write to master SQL instance 
MSSQL Connector write(overwrite) succeeded

In [7]:
#Read from SQL table using Apache Spark Connector for SQL Server and Azure SQL
print("Apache Spark Connector for SQL Server and Azure SQL read start ")
jdbcDF = spark.read \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", dbtable) \
        .option("url", url) \
        .option("dbtable", dbtable) \
        .option("principal", principal) \
        .option("keytab", keytab).load()

jdbcDF.show(5)

print("Apache Spark Connector for SQL Server and Azure SQL read done")

read data from SQL server table  
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|education|education_num|    marital_status|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_per_week|native_country|income|
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| <=50K|
| 38|         Private|215646|  HS-grad|            9|        

# (PART 2) Write and READ to/from Data Pools ( using Integrated Auth)
- Write dataframe to SQL external table in Data Pools in Big Data Cluste
- Read SQL external Table to Spark dataframe


User creation as follows
```

```

In [None]:
#Write from Spark to datapools using Apache Spark Connector for SQL Server and Azure SQL
print("Apache Spark Connector for SQL Server and Azure SQL write(overwrite) start ")

servername = "jdbc:sqlserver://master-p-svc:1433"
dbname = "spark_mssql_db"
security_spec = ";integratedSecurity=true;authenticationScheme=JavaKerberos;"
url = servername + ";" + "databaseName=" + dbname + security_spec

datapool_table = "AdultCensus_DataPoolTable"
principal = "testusera1@AZDATA.LOCAL"
keytab = "/user/testuser/testusera1.keytab" 

datasource_name = "connector_ds"

try:
  df.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", datapool_table) \
    .option("principal", principal) \
    .option("keytab", keytab) \
    .option("dataPoolDataSource",datasource_name) \
    .save()
except ValueError as error :
    print("Apache Spark Connector for SQL Server and Azure SQL write(overwrite) failed", error)

print("Apache Spark Connector for SQL Server and Azure SQL write(overwrite) done  ")

In [None]:
#Read from SQL table using Apache Spark Connector for SQL Server and Azure SQL
print("Apache Spark Connector for SQL Server and Azure SQL read data pool external table start ")
jdbcDF = spark.read \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", datapool_table) \
        .option("url", url) \
        .option("dbtable", dbtable) \
        .option("principal", principal) \
        .option("keytab", keytab).load()

jdbcDF.show(5)

print("Apache Spark Connector for SQL Server and Azure SQL read from data pool external table succeeded")