# Spark on GPU

Datalab also provide gpu for data analysis, this notebook shows you how to use GPU in a spark session. You can also use GPU for model training (e.g. tensorflow, pytorch, etc.)


In [1]:
import os
from pyspark.sql import SparkSession

spark = (SparkSession 
         .builder
         .master("k8s://https://kubernetes.default.svc:443")
         .config("spark.kubernetes.container.image", os.environ['IMAGE_NAME'])
         .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])
         .config("spark.executor.instances", "5")
         .config("spark.executor.memory", "4g")
         .config("spark.kubernetes.driver.pod.name", os.environ['KUBERNETES_POD_NAME'])
         
         # GPU specifique configuration
         .config("spark.executor.resource.gpu.amount", "1")
         .config("spark.task.resource.gpu.amount", "1")
         .config("spark.executor.resource.gpu.discoveryScript", "/opt/spark/examples/src/main/scripts/getGpusResources.sh")
         .config("spark.executor.resource.gpu.vendor", "nvidia.com")
         # spark.rapids.sql.enabled permet d'utiliser les gpus aussi pour les etapes SQL d'ETL. La pertinance de faire ça est à étudier => voir la vidéo mentionné en intro
         .config("spark.rapids.sql.enabled", "true")
         .config("spark.rapids.sql.incompatibleOps.enabled", "true")
         .config("spark.plugins", "com.nvidia.spark.SQLPlugin")
         .config("spark.rapids.force.caller.classloader", "false")

         .getOrCreate()
        )

sc = spark.sparkContext


## Check worker number
As the number of GPUs is limited inside the cluster, so you may not get the worker number that you have asked. Below are two ways to check the worker number

### Check the worker number via Spark UI

To check how many workers with gpu has been deoployed. You can use the spark UI to view the status of all the workers.

### Check the worker number via kubectl
You can also use below command to check your spark worker number

In [6]:
! kubectl get pods -l spark-role=executor

NAME                                    READY   STATUS    RESTARTS   AGE
pyspark-shell-f5aecc83465f6d1b-exec-1   1/1     Running   0          20m


In [7]:
! kubectl describe pods pyspark-shell-f5aecc83465f6d1b-exec-1

Name:         pyspark-shell-f5aecc83465f6d1b-exec-1
Namespace:    user-pengfei
Priority:     0
Node:         boss11/192.168.253.171
Start Time:   Fri, 16 Sep 2022 12:56:43 +0000
Labels:       spark-app-selector=spark-application-1663333003497
              spark-exec-id=1
              spark-exec-resourceprofile-id=0
              spark-role=executor
Annotations:  kubernetes.io/psp: default
Status:       Running
IP:           10.233.127.232
IPs:
  IP:           10.233.127.232
Controlled By:  Pod/rapidsai-264860-0
Containers:
  spark-kubernetes-executor:
    Container ID:  containerd://320701628668923c51e18af36cd192db9ccef2ba8f8ad95eb15dd66b48d54be6
    Image:         inseefrlab/rapidsai:cuda11.0-spark3.2.0
    Image ID:      docker.io/inseefrlab/rapidsai@sha256:71e9f007a5bb0e775b07e2a04bd5615457e582cd897a7f118e1bc3c1ab526aed
    Port:          7079/TCP
    Host Port:     0/TCP
    Args:
      executor
    State:          Running
      Started:      Fri, 16 Sep 2022 12:56:50 +0000
    R

## Do some analysis

In [8]:
work_dir="s3a://pengfei"
parquet_file_name="diffusion/data_format/sf_fire/parquet/raw"
data_path=f"{work_dir}/{parquet_file_name}"

In [10]:
df_raw=spark.read.parquet(data_path)

In [11]:
row_nb=df_raw.count()
col_nb=len(df_raw.columns)

print(f"data frame has : {row_nb} rows and {col_nb} columns")

data frame has : 5500519 rows and 34 columns


In [22]:
from pyspark.sql.functions import count, col

In [27]:
sample=df_raw.sample(0.1)

In [28]:
sample.explain()

== Physical Plan ==
*(1) Sample 0.0, 0.1, false, 7648596361908242375
+- GpuColumnarToRow false
   +- GpuFileGpuScan parquet [CallNumber#151,UnitID#152,IncidentNumber#153,CallType#154,CallDate#155,WatchDate#156,ReceivedDtTm#157,EntryDtTm#158,DispatchDtTm#159,ResponseDtTm#160,OnSceneDtTm#161,TransportDtTm#162,HospitalDtTm#163,CallFinalDisposition#164,AvailableDtTm#165,Address#166,City#167,ZipcodeofIncident#168,Battalion#169,StationArea#170,Box#171,OriginalPriority#172,Priority#173,FinalPriority#174,... 10 more fields] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[s3a://pengfei/diffusion/data_format/sf_fire/parquet/raw], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CallNumber:int,UnitID:string,IncidentNumber:int,CallType:string,CallDate:string,WatchDate:...




In [30]:
top10CallType=df_raw.groupBy("CallType").agg(count("IncidentNumber").alias("incident_number")).orderBy(col("incident_number").desc())
top10CallType.show(10)

+--------------------+---------------+
|            CallType|incident_number|
+--------------------+---------------+
|    Medical Incident|        3596332|
|      Structure Fire|         681179|
|              Alarms|         599263|
|   Traffic Collision|         224909|
|               Other|          87468|
|Citizen Assist / ...|          82173|
|        Outside Fire|          68491|
|        Water Rescue|          28253|
|        Vehicle Fire|          25512|
|Gas Leak (Natural...|          22961|
+--------------------+---------------+
only showing top 10 rows

