In [2]:
from pyspark.sql import SparkSession
import os

## Create spark session

As spark use jdbc to connect to the mysql server. We need to add the dependencies jar into the spark context

You can notice the below line in the code

```text
# add local jar file
config("spark.jars","/home/pengfei/git/RecetteConstance/app/lib/mysql-connector-java-8.0.30.jar")
```

In [3]:
local=True
if local:
    spark=(SparkSession.builder.master("local[4]") \
                  .appName("sparkMysql")\
                  .config("spark.jars","/home/pengfei/git/RecetteConstance/app/lib/mysql-connector-java-8.0.30.jar") \
                   .getOrCreate())
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("RepartitionCSV") \
                      .config("spark.kubernetes.container.image",os.environ["IMAGE_NAME"]) \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.driver.pod.name", os.environ["POD_NAME"]) \
                      .config('spark.jars.packages','org.postgresql:postgresql:42.2.24') \
                      .getOrCreate()

23/09/11 10:01:31 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
23/09/11 10:01:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
23/09/11 10:01:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [4]:
columns = ["id", "name","age","gender"]
data = [(1, "James",30,"M"), (2, "Ann",40,"F"),
    (3, "Jeff",41,"M"),(4, "Jennifer",20,"F")]

In [5]:
sampleDF = spark.createDataFrame(data,schema=columns)
sampleDF.show()

                                                                                

+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|  1|   James| 30|     M|
|  2|     Ann| 40|     F|
|  3|    Jeff| 41|     M|
|  4|Jennifer| 20|     F|
+---+--------+---+------+


In [6]:
sampleDF.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- gender: string (nullable = true)


## Write dataframe to mysql server

To write a dataframe to the mysql server, we need four basic required information:
- **database name** : which database we will write to 
- **table name**: The table name which will host the data
- **uid of the connexion credential**: user id to connect to the database
- **password of the connexion credential**: user password to connect to the database

The parallelism(jdbc connection number) of write depends on the partition number of the dataframe. 

In [29]:
# mysql connexion config
host = "localhost"
port = 3306
dbName="constance"
mysqlUrl = f"jdbc:mysql://{host}:{port}/{dbName}"
driverName = "com.mysql.cj.jdbc.Driver"
tabName = "employee"
uid="recette"
pwd = "casd2023"

In [12]:
sampleDF.write \
  .format("jdbc") \
  .option("driver",driverName) \
  .option("url", mysqlUrl) \
  .option("dbtable", f"{tabName}") \
  .option("user", f"{uid}") \
  .option("password", f"{pwd}") \
  .save()

AnalysisException: Table or view 'employee' already exists. SaveMode: ErrorIfExists.

### Some Useful options 

- **mode("overwrite")**: It drops the table if already exists by default and re-creates a new one without indexes.
- **mode("append")**: It conserves the old rows in the table, and append the new rows to the existing MySQL database table.
- **option("truncate","true")**:  It retains the index.
- **option("mssqlIsolationLevel", "READ_UNCOMMITTED")** : This connector by default uses READ_COMMITTED isolation level. You can change the default value by using this option.

## Read the mysql server table into a dataframe

In [8]:
readDF= spark.read \
  .format("jdbc") \
  .option("driver","com.mysql.cj.jdbc.Driver") \
  .option("url", f"jdbc:mysql://localhost:3306/{dbName}") \
  .option("dbtable", f"{tabName}") \
  .option("user", f"{uid}") \
  .option("password", f"{pwd}") \
  .load()

In [9]:
readDF.show()

+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|  4|Jennifer| 20|     F|
|  3|    Jeff| 41|     M|
|  2|     Ann| 40|     F|
|  1|   James| 30|     M|
+---+--------+---+------+


### Read data with projection/predicate pushdown

We can use specific columns and condition to load less data into spark, so it can be more performed.

Some important points:
- **Note that you can use either `dbtable` or `query` option, but not both at a time** 
- When use `query` option, the query must start with `select` can't be simple table name.
- **When using the `dbtable` option, you can’t use `partitionColumn` option.**

In [26]:
query = "select id, gender from employee where age>23"

In [27]:
pushDownDF= spark.read \
  .format("jdbc") \
  .option("driver","com.mysql.cj.jdbc.Driver") \
  .option("url", f"jdbc:mysql://localhost:3306/{dbName}") \
  .option("query", f"{query}") \
  .option("user", f"{uid}") \
  .option("password", f"{pwd}") \
  .load()

In [28]:
pushDownDF.show()

+---+------+
| id|gender|
+---+------+
|  3|     M|
|  2|     F|
|  1|     M|
+---+------+


### Read mysql table in parallel

Two useful options if you want to read the table in parallel:

- Use option **numPartitions** to read MySQL table in parallel. This property also determines the maximum number of concurrent JDBC connections to use. 
- Use option **fetchsize** to specify how many rows to fetch at a time, by default it is set to 10.

The below example creates the DataFrame with 2 partitions (2 jdbc connection in parallel), and fetch 20 row at a time

In [38]:
query2="select * from employee"
paraReadDF = spark.read \
  .format("jdbc") \
  .option("driver","com.mysql.cj.jdbc.Driver") \
  .option("url", f"jdbc:mysql://localhost:3306/{dbName}") \
  .option("query", f"{query2}").option("numPartitions","2").option("fetchsize","20").option("user", f"{uid}")\
  .option("password", f"{pwd}")\
  .load()

In [39]:
paraReadDF.show()


+---+--------+---+------+
| id|    name|age|gender|
+---+--------+---+------+
|  4|Jennifer| 20|     F|
|  3|    Jeff| 41|     M|
|  2|     Ann| 40|     F|
|  1|   James| 30|     M|
+---+--------+---+------+


In [37]:
paraReadDF.rdd.getNumPartitions()

1

The partition number is 1, so the parallel read failed. Let's try with a bigger file


In [40]:
sfPath="/home/pengfei/data_set/sf_fire/sf_fire_snappy.parquet"
df = spark.read.parquet(sfPath)

                                                                                

In [41]:
df.show()

23/09/08 16:44:23 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.




+----------+------+--------------+--------------------+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+-----------------+---------+-----------+----+----------------+--------+-------------+-------+--------------------+--------------+--------------+--------------------------+----------------------+------------------+--------------------+----------------+--------------------+
|CallNumber|UnitID|IncidentNumber|            CallType|  CallDate| WatchDate|        ReceivedDtTm|           EntryDtTm|        DispatchDtTm|        ResponseDtTm|         OnSceneDtTm|       TransportDtTm|        HospitalDtTm|CallFinalDisposition|       AvailableDtTm|             Address|         City|ZipcodeofIncident|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|       CallTypeGroup|NumberofAla

                                                                                

In [42]:
df.rdd.getNumPartitions()

5

## Write dataframe as table

We can also write a dataframe into a table in mysql server. Spark need jdbc driver to write in mysql server.

We can use two method:
- method 1: df.write.format("jdbc")
- method 2: df.write.jdbc() 

In summary, both approaches can be used to achieve the same goal of writing a DataFrame to a JDBC data source, but the `df.write.jdbc()` method provides a more concise and convenient way to specify the necessary options for the operation. The choice between them depends on your preference and the complexity of your write operation.

### An example with the method 1

You can notice to specify the write method, we need to write many option lines.

In [24]:
tabName = "employee"
sampleDF.write \
  .format("jdbc") \
  .option("driver",driverName) \
  .option("url", mysqlUrl) \
  .option("dbtable", f"{tabName}") \
  .option("user", f"{uid}") \
  .option("password", f"{pwd}") \
  .mode(saveMode="overwrite") \
  .save()

### An example with the method 2

In [17]:

mysqlProperties = {
    "user": f"{uid}",
    "password": f"{pwd}",
    "driver": driverName,
    "rewriteBatchedStatements": "true",
    "batchPerformanceWorkaround": "true",
    "batchsize": "1000"
}
sampleDF.write.jdbc(url=mysqlUrl, table=tabName, mode="overwrite", properties=mysqlProperties)

                                                                                

### append more rows into one table

In [18]:
columns = ["id", "name","age","gender"]
data = [(5, "JamesN",30,"M"), (6, "AnnN",40,"F"),
    (7, "JeffN",41,"M"),(8, "JenniferN",20,"F")]

In [19]:
extraDF = spark.createDataFrame(data,schema=columns)
extraDF.show()

+---+---------+---+------+
| id|     name|age|gender|
+---+---------+---+------+
|  5|   JamesN| 30|     M|
|  6|     AnnN| 40|     F|
|  7|    JeffN| 41|     M|
|  8|JenniferN| 20|     F|
+---+---------+---+------+


In [20]:
# insert extra rows into mysql server
extraDF.write.jdbc(url=mysqlUrl, table=tabName, mode="append", properties=mysqlProperties)

### Optimization of the write operation

Writing a big dataframe may take long time. So we need to improve the speed of writing a DataFrame to a MySQL server. Here are some tips to improve the write performance:

- **Use Batch Inserts**: Instead of inserting each row individually, batch your inserts. Use the `bulkCopyToTable` method from the MySQL Connector/J library to perform batch inserts, which can significantly improve write performance. The exact implementation will depend on the MySQL connector you are using.

- **Tune the Number of Partitions**: Ensure that your DataFrame is properly partitioned. The number of partitions should match the degree of parallelism available in your cluster. You can repartition your DataFrame using the repartition() or coalesce() methods to control the number of partitions. If your DataFrame is very large, right partition number can reduce the size of individual write operations.

- **Use JDBC Connection Properties**: Set appropriate JDBC connection properties to optimize the write operation. This includes tuning the `rewriteBatchedStatements` and `batchPerformanceWorkaround` options if using MySQL Connector/J.

- **Choose the Right Write Mode**: Spark allows you to specify different write modes, such as "overwrite," "append," and "ignore." Choose the write mode that fits your use case. For example, if you're appending data, use "append" mode to avoid overwriting existing data.

- **Compression**: Enable compression when writing data to MySQL if your data is large. Compression can reduce the amount of data transferred and improve write performance. The level of compression can be controlled using JDBC connection properties.

- **Partitioned Tables**: If possible, design your database tables to be partitioned based on columns that are frequently filtered or used in queries. This can significantly improve query performance.

- **Indexing:** Properly index your MySQL tables to optimize write and read operations. However, be cautious with indexing, as it can impact write performance during inserts and updates.

- **Optimize MySQL Configuration**: Ensure that your MySQL server is properly configured for write-intensive workloads. Tune MySQL's configuration parameters such as `innodb_buffer_pool_size`, `innodb_flush_log_at_trx_commit`, and others, based on your specific use case.

- **Hardware Scaling**: If you have control over your infrastructure, consider scaling up your MySQL server by increasing CPU, memory, or using faster storage solutions like SSDs to improve write performance.


In below example, 
- we first repartition the dataframe to 32 partition, because we have `4 worker`
- We add below jdbc config "rewriteBatchedStatements": "true", "batchPerformanceWorkaround": "true", "batchsize": "1000"

For a parquet file of 600 MB, and 5500519 rows, it takes 5 mins

In [25]:
sfPath="/home/pengfei/data_set/sf_fire/sf_fire_snappy.parquet"
df = spark.read.parquet(sfPath)
df.show(5)

23/09/11 15:58:25 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.




+----------+------+--------------+--------------------+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+------------+--------------------+--------------------+--------------------+-------------+-----------------+---------+-----------+----+----------------+--------+-------------+-------+-------------+--------------+--------+--------------------------+----------------------+------------------+--------------------+-------------+--------------------+
|CallNumber|UnitID|IncidentNumber|            CallType|  CallDate| WatchDate|        ReceivedDtTm|           EntryDtTm|        DispatchDtTm|        ResponseDtTm|         OnSceneDtTm|TransportDtTm|HospitalDtTm|CallFinalDisposition|       AvailableDtTm|             Address|         City|ZipcodeofIncident|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumberofAlarms|UnitType|Unitsequenceincalldispatch|FirePreventio

                                                                                

In [31]:
df.count()

                                                                                

5500519

In [26]:
df = df.repartition(32)
print(df.rdd.getNumPartitions())



23/09/11 15:59:45 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 15:59:45 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 15:59:45 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 15:59:46 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 15:59:46 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 15:59:46 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 15:59:47 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 15:59:47 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 15:59:47 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 15:59:47 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 15:59:48 WARN TaskMemoryManager



32


In [30]:
tabName = "sf_fire"
mysqlProperties = {
    "user": f"{uid}",
    "password": f"{pwd}",
    "driver": driverName,
    "rewriteBatchedStatements": "true",
    "batchPerformanceWorkaround": "true",
    "batchsize": "1000"
}
df.write.jdbc(url=mysqlUrl, table=tabName, mode="overwrite", properties=mysqlProperties)



23/09/11 16:03:24 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 16:03:24 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 16:03:25 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 16:03:25 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 16:03:25 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 16:03:25 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 16:03:26 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 16:03:26 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 16:03:26 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 16:03:27 WARN TaskMemoryManager: Failed to allocate a page (4194288 bytes), try again.
23/09/11 16:03:27 WARN TaskMemoryManager

                                                                                