In [1]:
from pyspark.sql import SparkSession
import os

## Create spark session

As spark use jdbc to connect to the mysql server. We need to add the dependencies jar into the spark context

You can notice the below line in the code

```text
# add local jar file
config("spark.jars","/home/pengfei/git/RecetteConstance/app/lib/mysql-connector-java-8.0.30.jar")
```

In [2]:
local=True
if local:
    spark=(SparkSession.builder.master("local[4]") \
                  .appName("sparkMysql")\
                  .config("spark.jars","/home/pengfei/git/RecetteConstance/app/lib/mysql-connector-java-8.0.30.jar") \
                   .getOrCreate())
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("RepartitionCSV") \
                      .config("spark.kubernetes.container.image",os.environ["IMAGE_NAME"]) \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.driver.pod.name", os.environ["POD_NAME"]) \
                      .config('spark.jars.packages','org.postgresql:postgresql:42.2.24') \
                      .getOrCreate()

23/09/08 15:54:58 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
23/09/08 15:54:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
23/09/08 15:54:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
columns = ["id", "name","age","gender"]
data = [(1, "James",30,"M"), (2, "Ann",40,"F"),
    (3, "Jeff",41,"M"),(4, "Jennifer",20,"F")]

In [5]:
sampleDF = spark.createDataFrame(data,schema=columns)
sampleDF.show()

                                                                                

+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|  1|   James| 30|     M|
|  2|     Ann| 40|     F|
|  3|    Jeff| 41|     M|
|  4|Jennifer| 20|     F|
+---+--------+---+------+


In [14]:
sampleDF.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- gender: string (nullable = true)


## Write dataframe to mysql server

To write a dataframe to the mysql server, we need four basic required information:
- **database name** : which database we will write to 
- **table name**: The table name which will host the data
- **uid of the connexion credential**: user id to connect to the database
- **password of the connexion credential**: user password to connect to the database

The parallelism(jdbc connection number) of write depends on the partition number of the dataframe. 

In [6]:
dbName="constance"
tabName = "employee"
uid="recette"
pwd = "casd2023"

In [7]:
sampleDF.write \
  .format("jdbc") \
  .option("driver","com.mysql.cj.jdbc.Driver") \
  .option("url", f"jdbc:mysql://localhost:3306/{dbName}") \
  .option("dbtable", f"{tabName}") \
  .option("user", f"{uid}") \
  .option("password", f"{pwd}") \
  .save()

                                                                                

### Some Useful options 

- **mode("overwrite")**: It drops the table if already exists by default and re-creates a new one without indexes.
- **mode("append")**: It conserves the old rows in the table, and append the new rows to the existing MySQL database table.
- **option("truncate","true")**:  It retains the index.
- **option("mssqlIsolationLevel", "READ_UNCOMMITTED")** : This connector by default uses READ_COMMITTED isolation level. You can change the default value by using this option.

## Read the mysql server table into a dataframe

In [8]:
readDF= spark.read \
  .format("jdbc") \
  .option("driver","com.mysql.cj.jdbc.Driver") \
  .option("url", f"jdbc:mysql://localhost:3306/{dbName}") \
  .option("dbtable", f"{tabName}") \
  .option("user", f"{uid}") \
  .option("password", f"{pwd}") \
  .load()

In [9]:
readDF.show()

+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|  4|Jennifer| 20|     F|
|  3|    Jeff| 41|     M|
|  2|     Ann| 40|     F|
|  1|   James| 30|     M|
+---+--------+---+------+


### Read data with projection/predicate pushdown

We can use specific columns and condition to load less data into spark, so it can be more performed.

Some important points:
- **Note that you can use either `dbtable` or `query` option, but not both at a time** 
- When use `query` option, the query must start with `select` can't be simple table name.
- **When using the `dbtable` option, you can’t use `partitionColumn` option.**

In [26]:
query = "select id, gender from employee where age>23"

In [27]:
pushDownDF= spark.read \
  .format("jdbc") \
  .option("driver","com.mysql.cj.jdbc.Driver") \
  .option("url", f"jdbc:mysql://localhost:3306/{dbName}") \
  .option("query", f"{query}") \
  .option("user", f"{uid}") \
  .option("password", f"{pwd}") \
  .load()

In [28]:
pushDownDF.show()

+---+------+
| id|gender|
+---+------+
|  3|     M|
|  2|     F|
|  1|     M|
+---+------+


### Read mysql table in parallel

Two useful options if you want to read the table in parallel:

- Use option **numPartitions** to read MySQL table in parallel. This property also determines the maximum number of concurrent JDBC connections to use. 
- Use option **fetchsize** to specify how many rows to fetch at a time, by default it is set to 10.

The below example creates the DataFrame with 2 partitions (2 jdbc connection in parallel), and fetch 20 row at a time

In [38]:
query2="select * from employee"
paraReadDF = spark.read \
  .format("jdbc") \
  .option("driver","com.mysql.cj.jdbc.Driver") \
  .option("url", f"jdbc:mysql://localhost:3306/{dbName}") \
  .option("query", f"{query2}").option("numPartitions","2").option("fetchsize","20").option("user", f"{uid}")\
  .option("password", f"{pwd}")\
  .load()

In [39]:
paraReadDF.show()


+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|  4|Jennifer| 20|     F|
|  3|    Jeff| 41|     M|
|  2|     Ann| 40|     F|
|  1|   James| 30|     M|
+---+--------+---+------+


In [37]:
paraReadDF.rdd.getNumPartitions()

1

The partition number is 1, so the parallel read failed. Let's try with a bigger file


In [40]:
sfPath="/home/pengfei/data_set/sf_fire/sf_fire_snappy.parquet"
df = spark.read.parquet(sfPath)

                                                                                

In [41]:
df.show()

23/09/08 16:44:23 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.




+----------+------+--------------+--------------------+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+-----------------+---------+-----------+----+----------------+--------+-------------+-------+--------------------+--------------+--------------+--------------------------+----------------------+------------------+--------------------+----------------+--------------------+
|CallNumber|UnitID|IncidentNumber|            CallType|  CallDate| WatchDate|        ReceivedDtTm|           EntryDtTm|        DispatchDtTm|        ResponseDtTm|         OnSceneDtTm|       TransportDtTm|        HospitalDtTm|CallFinalDisposition|       AvailableDtTm|             Address|         City|ZipcodeofIncident|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|       CallTypeGroup|NumberofAla

                                                                                

In [42]:
df.rdd.getNumPartitions()

5

In [None]:
tabName = "sf_fire"
df.write \
  .format("jdbc") \
  .option("driver","com.mysql.cj.jdbc.Driver") \
  .option("url", f"jdbc:mysql://localhost:3306/{dbName}") \
  .option("dbtable", f"{tabName}") \
  .option("user", f"{uid}") \
  .option("password", f"{pwd}") \
  .save()

