Link to dataset: https://archive.ics.uci.edu/ml/datasets/SGEMM+GPU+kernel+performance#

Import libraries:

In [1]:
import pyspark
from pyspark.sql import *
spark = SparkSession.builder.getOrCreate()

Load dataset into Spark dataframe:

In [2]:
df = spark.read.csv("sgemm_product.csv", header="true", inferSchema="true")

Examine first 10 rows of dataset.

In [3]:
df.show(10)

+---+---+---+-----+-----+-----+-----+---+---+---+----+----+---+---+---------+---------+---------+---------+
|MWG|NWG|KWG|MDIMC|NDIMC|MDIMA|NDIMB|KWI|VWM|VWN|STRM|STRN| SA| SB|Run1 (ms)|Run2 (ms)|Run3 (ms)|Run4 (ms)|
+---+---+---+-----+-----+-----+-----+---+---+---+----+----+---+---+---------+---------+---------+---------+
| 16| 16| 16|    8|    8|    8|    8|  2|  1|  1|   0|   0|  0|  0|   115.26|   115.87|   118.55|    115.8|
| 16| 16| 16|    8|    8|    8|    8|  2|  1|  1|   0|   0|  0|  1|    78.13|    78.25|    79.25|    79.19|
| 16| 16| 16|    8|    8|    8|    8|  2|  1|  1|   0|   0|  1|  0|    79.84|    80.69|    80.76|    80.97|
| 16| 16| 16|    8|    8|    8|    8|  2|  1|  1|   0|   0|  1|  1|    84.32|     89.9|    86.75|    85.58|
| 16| 16| 16|    8|    8|    8|    8|  2|  1|  1|   0|   1|  0|  0|   115.13|   121.98|   122.73|   114.81|
| 16| 16| 16|    8|    8|    8|    8|  2|  1|  1|   0|   1|  0|  1|     81.1|    82.41|    87.01|    82.14|
| 16| 16| 16|    8|    8|   

We see some integer and a few binary variables. Let us pick any one Run variable as our target variable, say Run1. So drop the Run 2, 3, 4 columns.

In [4]:
df = df.drop("Run2 (ms)", "Run3 (ms)", "Run4 (ms)")

Check schema and data types of variables. Convert all variables to double if they are not already so.

In [5]:
df.printSchema()

root
 |-- MWG: integer (nullable = true)
 |-- NWG: integer (nullable = true)
 |-- KWG: integer (nullable = true)
 |-- MDIMC: integer (nullable = true)
 |-- NDIMC: integer (nullable = true)
 |-- MDIMA: integer (nullable = true)
 |-- NDIMB: integer (nullable = true)
 |-- KWI: integer (nullable = true)
 |-- VWM: integer (nullable = true)
 |-- VWN: integer (nullable = true)
 |-- STRM: integer (nullable = true)
 |-- STRN: integer (nullable = true)
 |-- SA: integer (nullable = true)
 |-- SB: integer (nullable = true)
 |-- Run1 (ms): double (nullable = true)



In [6]:
from pyspark.sql.functions import col
df = df.select([col(c).cast("double") for c in df.columns])

Now, apply gradient boosting regression model to the dataset since our target variable is continuous. Train and test the model and get the error rates for them.
Sources cited: 
https://docs.databricks.com/_extras/notebooks/source/gbt-regression.html  https://www.datatechnotes.com/2021/05/mllib-gradient-boosted-tree-regression.html

In [14]:
from pyspark.ml.feature import VectorAssembler
 
# Remove the target column from the input feature set.
features = df.columns
features.remove('Run1 (ms)')

#We need to pack all the predictor columns into one for PySpark.
from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler(inputCols=features, outputCol="features")

va_df = vectorAssembler.transform(df)

#Split data into test and train: 80-20
train, test = va_df.randomSplit([0.8, 0.2], seed = 0)

#Set up gradient booster
from pyspark.ml.regression import GBTRegressor

gbt = GBTRegressor(featuresCol='features', labelCol="Run1 (ms)", maxIter=10)
gbt = gbt.fit(train)

mdata = gbt.transform(test)

from pyspark.ml.evaluation import RegressionEvaluator

rmse=RegressionEvaluator(labelCol="Run1 (ms)", predictionCol="prediction", metricName="rmse")
rmse=rmse.evaluate(mdata) 
 
print("RMSE: ", rmse)

RMSE:  127.43825270739755


Conclusion: We got low RMSE for our model, although more could have been done to clean and preprocess the data.