## Running Linear Regression Using SparkML in GCP DataProc

Run a linear regression using Apache Spark ML.

In the following PySpark (Spark Python API) code, we take the following actions:

  * Load a previously created linear regression (BigQuery) input table
    into our Cloud Dataproc Spark cluster as an RDD (Resilient
    Distributed Dataset)
  * Transform the RDD into a Spark Dataframe
  * Vectorize the features on which the model will be trained
  * Compute a linear regression using Spark ML

In [1]:
# Import libraries

from __future__ import print_function
from pyspark.context import SparkContext
from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import LinearRegression
from pyspark.sql.session import SparkSession
# The imports, above, allow us to access SparkML features specific to linear
# regression as well as the Vectors types


 Next, we define a function that collects the features of interest
(mother_age, father_age, and gestation_weeks) into a vector.


Package the vector in a tuple containing the label (`weight_pounds`) for that
row.

In [2]:
def vector_from_inputs(r):
    return (r["weight_pounds"], Vectors.dense(float(r["mother_age"]),
                                            float(r["father_age"]),
                                            float(r["gestation_weeks"]),
                                            float(r["weight_gain_pounds"]),
                                            float(r["apgar_5min"])))

# sc = SparkContext()
# spark = SparkSession(sc)

spark = SparkSession.builder.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/04/06 16:20:06 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
22/04/06 16:20:06 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
22/04/06 16:20:06 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
22/04/06 16:20:07 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator


### Retrieve data from BigQuery

In [3]:
natality_data = spark.read.format("bigquery").option(
    "table", "natality_regression_lcmhng.regression_input").load()
# Create a view so that Spark SQL queries can be run against the data.
natality_data.createOrReplaceTempView("natality")

### Ensure we do not have null values in the retrieved data

In [4]:
# As a precaution, run a query in Spark SQL to ensure no NULL values exist.
sql_query = """
SELECT *
from natality
where weight_pounds is not null
and mother_age is not null
and father_age is not null
and gestation_weeks is not null
"""
clean_data = spark.sql(sql_query)

## Train Model and get results

In [5]:
# Create an input DataFrame for Spark ML using the above function.
training_data = clean_data.rdd.map(vector_from_inputs).toDF(["label",
                                                             "features"])
training_data.cache()

                                                                                

DataFrame[label: double, features: vector]

In [6]:
# Construct a new LinearRegression object and fit the training data.
lr = LinearRegression(maxIter=5, regParam=0.2, solver="normal")
model = lr.fit(training_data)
# Print the model summary.
print("Coefficients:" + str(model.coefficients))
print("Intercept:" + str(model.intercept))
print("R^2:" + str(model.summary.r2))
model.summary.residuals.show()

22/04/06 16:31:33 WARN com.github.fommil.netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/04/06 16:31:33 WARN com.github.fommil.netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
22/04/06 16:31:33 WARN com.github.fommil.netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
22/04/06 16:31:33 WARN com.github.fommil.netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
                                                                                

Coefficients:[0.0166657454631094,-0.0029675198396670547,0.23571439297525,0.002130020702160164,-0.0004857725171350398]
Intercept:-2.2613033091401618
R^2:0.2952005790467246
+--------------------+
|           residuals|
+--------------------+
|  1.5200177066924123|
| 0.43406716816449453|
|  1.1069502679100047|
|   -2.39374606430762|
| -0.3422831562811899|
| -1.7954576684049028|
|  0.4321400465038314|
| -2.1437081115992154|
| -0.5338981682655755|
|-0.47571385742194483|
|  0.6014382599197452|
|   0.610696485413726|
|  0.8446255763004125|
|   -2.51233134747541|
| 0.03278077088410214|
| 0.15503577486861353|
| -0.3360047199767555|
|   2.098502228284863|
|-0.25160131787674267|
|  1.1089628643295146|
+--------------------+
only showing top 20 rows

