Skip to content

GangliaExport is a lightweight package provides an alternative way to monitor Databricks cluster utilization using Ganglia Web Service on the driver node of each cluster. The Ganglia metrics can be exported to any Spark datasource format which can be used to analyze cluster usages to avoid costly idle computing.

Notifications You must be signed in to change notification settings

howardwong-db/GangliaExport

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GangliaExport

This lightweight package provides an alternative way to monitor Databricks cluster utilization using Ganglia Web Service on the driver node of each cluster. The Ganglia metrics can be exported to any Spark datasource format which can be used to analyze cluster usages to avoid costly idle computing.

  • Metrics can be saved in cloud storage in any Spark supported format including Delta and use tools like DB SQL to monitor performance
  • No other monitoring dependencies which reduces complexity
  • Ideal for job optimization monitoring for automated job clusters

See this link for info on the Ganglia api service and xml schema.

Other options to monitor cluster utilization:

  • Monitor Ganglia while a job is running or look at Ganglia snapshots in each cluster manually
  • Install init scripts to send metrics to Datadog or Log Analytics as noted in Manage Clusters

Note:

  • UC clusters 11.3+ causes Missing Credential Scope when running on another thread using scheduled executors. (Fix/TODO)
  • Ganglia will not be avaliable on DBR 13+. This will be replaced with system tables via cluster metrics.

Support

Spark Version Scala Version Supported
Spark 2.x 2.11 yes
Spark 3.x 2.12 yes

See build file for specific Spark versions. Python support is TBD

Build

Package

sbt +package

Uber jar

sbt +assembly

Test

sbt +test

Usage

Load the ganglia export assembly lib into the all purpose cluster or add as a dependency lib in your automated job

Collect Metrics

Add the following code to your Spark application or notebook:

import com.databricks.gangliaexport.GangliaExport
val f = GangliaExport.exportMetrics(spark,"delta", Map.empty[String,String],"/tmp/howard.wong@databricks.com/gangliametricstest", 30)
//...

f.cancel(true) 

Or Collect as DataFrame and save:

val gc = new GangliaClient()
var df = GangliaExport.collectMetrics(spark, gc)
df.write.format("csv").partitionBy("reportingDate").mode("append").save("/tmp/testexport")

Read Metrics

val metrics = spark.read.format("delta").load("/tmp/howard.wong@databricks.com/gangliametricstest")

Schema:

root
 |-- clusterName: string (nullable = true)
 |-- clusterId: string (nullable = true)
 |-- hostName: string (nullable = true)
 |-- hostIp: string (nullable = true)
 |-- reportingTime: long (nullable = true)
 |-- metricName: string (nullable = true)
 |-- metricType: string (nullable = true)
 |-- metricUnits: string (nullable = true)
 |-- metricVal: double (nullable = true)
 |-- reportingDate: date (nullable = true)

Use a sample notebook to try this.

Sample Queries

Create a temporary view table

val metrics = spark.read.format("delta").load("/tmp/howard.wong@databricks.com/gangliametricstest")
metrics.createOrReplaceTempView("metrics")

CPU Load

%sql
select
  from_unixtime(reportingTime) as reportingTime,
  metricName,
  (100 - metricVal) as cpupercent
from
  metrics
where
  metricName = "cpu_idle"
  and clusterName = 'exporttest'
order by
  reportingTime

CPU Usage - Chart using series grouping by meticName and avg(metricVal)

%sql
select
  from_unixtime(reportingTime) as reportingTime,
  clusterName,
  metricName,
  metricVal
from
  metrics
where
  metricName like 'cpu%'
  and metricUnits = '%'
  and clusterName = 'exporttest'
  and reportingDate BETWEEN DATE '2020-01-01'
  AND DATE '2021-01-31'
order by
  reportingTime,
  clusterName,
  metricName

CPU Usages - Chart using pivot (metricVal rows to columns)

%sql
select
  *
from
  (
    select
      from_unixtime(reportingTime) as reportingTime,
      clusterName,
      metricName,
      metricVal
    from
      metrics
    where
      metricName like 'cpu%'
      and metricUnits = '%'
      and clusterName = 'exporttest'
      and reportingDate BETWEEN DATE '2020-01-01' AND DATE '2021-01-31'
  ) pivot (
    avg(metricVal) for metricName in (
      'cpu_idle' cpu_idle,
      'cpu_system' cpu_system,
      'cpu_user' cpu_user,
      'cpu_nice' cpu_nice
    )
  )
order by
  reportingTime

Memory Usage - Chart using pivot

%sql
select
  *
from
  (
    select
      from_unixtime(reportingTime) as reportingTime,
      clusterName,
      metricName,
      metricVal
    from
      metrics
    where
      metricName like 'cpu%'
      and metricUnits = '%'
      and clusterName = 'exporttest'
      and reportingTime BETWEEN DATE '2020-01-01' AND DATE '2021-01-31'
  ) pivot (
    avg(metricVal) for metricName in (
      'cpu_idle' cpu_idle,
      'cpu_system' cpu_system,
      'cpu_user' cpu_user,
      'cpu_nice' cpu_nice
    )
  )
order by
  reportingTime

About

GangliaExport is a lightweight package provides an alternative way to monitor Databricks cluster utilization using Ganglia Web Service on the driver node of each cluster. The Ganglia metrics can be exported to any Spark datasource format which can be used to analyze cluster usages to avoid costly idle computing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages