### Practical implementation
Now, let’s delve into coding to understand the practical
implementation of regression system.


### About the dataset
The California Housing dataset is a popular dataset that’s used in
data science and machine learning to demonstrate algorithms and
techniques for regression analysis. This dataset typically serves as a
benchmark for predictive modeling tasks, where the goal is to
forecast median house values in Californian districts based on
various features. Originating from the 1990s, it was compiled for a
study related to housing needs in California and has since become a
staple example in machine learning communities, particularly for
those starting with data science.

The California Housing dataset is widely used for teaching and
testing regression models and is a type of predictive modeling
technique that’s used to understand relationships between
independent variables (features) and a continuous dependent
variable (target). In the context of this dataset, regression models
are built to predict the median house value based on district-level
characteristics.


In [1]:
import warnings
warnings.filterwarnings("ignore")

### Loading the California Housing data
The following lines of code mainly focus on loading the dataset,
converting it into a pandas DataFrame, and then creating a Spark
DataFrame from it:

In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression, GeneralizedLinearRegression, DecisionTreeRegressor, RandomForestRegressor, GBTRegressor
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from sklearn.datasets import fetch_california_housing

# Create a SparkSession
spark = SparkSession.builder \
    .appName("RegressionPipelineExample") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

housing = fetch_california_housing()

import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items

25/11/13 18:14:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/13 18:14:47 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


The preceding code snippet will output the content in DataFrame
format, as shown here:


In [3]:
df = pd.DataFrame(data=housing.data,columns=housing.feature_names)
df['label'] = housing.target
df_data = spark.createDataFrame(df)
df_data.show(5) 

+------+--------+------------------+------------------+----------+------------------+--------+---------+-----+
|MedInc|HouseAge|          AveRooms|         AveBedrms|Population|          AveOccup|Latitude|Longitude|label|
+------+--------+------------------+------------------+----------+------------------+--------+---------+-----+
|8.3252|    41.0| 6.984126984126984|1.0238095238095237|     322.0|2.5555555555555554|   37.88|  -122.23|4.526|
|8.3014|    21.0| 6.238137082601054|0.9718804920913884|    2401.0| 2.109841827768014|   37.86|  -122.22|3.585|
|7.2574|    52.0| 8.288135593220339| 1.073446327683616|     496.0|2.8022598870056497|   37.85|  -122.24|3.521|
|5.6431|    52.0|5.8173515981735155|1.0730593607305936|     558.0| 2.547945205479452|   37.85|  -122.25|3.413|
|3.8462|    52.0| 6.281853281853282|1.0810810810810811|     565.0|2.1814671814671813|   37.85|  -122.25|3.422|
+------+--------+------------------+------------------+----------+------------------+--------+---------+-----+
o

The following code snippet prepares the dataset for machine
learning by transforming the feature columns into a format
compatible with Spark MLlib’s algorithms, allowing the models to
be trained on large-scale data:

In [4]:
# Prepare the features column
feature_cols = df_data.columns[:-1]  # Assuming the last column is the label
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_data = assembler.transform(df_data)
df_data.show(5,100)

+------+--------+------------------+------------------+----------+------------------+--------+---------+-----+-----------------------------------------------------------------------------------------+
|MedInc|HouseAge|          AveRooms|         AveBedrms|Population|          AveOccup|Latitude|Longitude|label|                                                                                 features|
+------+--------+------------------+------------------+----------+------------------+--------+---------+-----+-----------------------------------------------------------------------------------------+
|8.3252|    41.0| 6.984126984126984|1.0238095238095237|     322.0|2.5555555555555554|   37.88|  -122.23|4.526|[8.3252,41.0,6.984126984126984,1.0238095238095237,322.0,2.5555555555555554,37.88,-122.23]|
|8.3014|    21.0| 6.238137082601054|0.9718804920913884|    2401.0| 2.109841827768014|   37.86|  -122.22|3.585|[8.3014,21.0,6.238137082601054,0.9718804920913884,2401.0,2.109841827768014,37.86,-122.

Feature scaling and normalization are important preprocessing
steps in the data preparation phase of machine learning and data
modeling. These techniques adjust the scale or distribution of
features (variables) in your data, improving model performance,
ensuring faster convergence, and providing balanced regularization.
The following code snippet uses StandardScaler, a feature
transformation tool used for scaling and normalizing features in
data preprocessing for machine learning models:

In [5]:
# Scale and normalize features
scaler = StandardScaler(inputCol="raw_features", outputCol="scaled_features", withStd=True, withMean=True)

Splitting data is a fundamental practice in machine learning for
validating models. It helps prevent overfitting, where a model
might perform well on the training data but poorly on new, unseen
data:


In [6]:
# Split the dataset into training and testing sets
(trainingData, testData) = df_data.randomSplit([0.8, 0.2])

25/11/13 18:14:59 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(scavenge), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
25/11/13 18:14:59 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(global, scavenge), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


Initializing machine learning models, such as the ones in our code
snippet, is a crucial step in the machine learning workflow.
The following code snippet initializes five different regression
models from Apache Spark’s MLlib, each designed for different
types of regression tasks. All these models are configured with a
specified features column named features and a label column
named label. In Spark MLlib, the featuresCol parameter is
used to specify the input column that contains feature vectors, and
the labelCol parameter specifies the column containing the target
variable:


In [7]:
# Initialize regression models
lr = LinearRegression(featuresCol="features", labelCol="label")
glr = GeneralizedLinearRegression(featuresCol="features", labelCol="label")
dt = DecisionTreeRegressor(featuresCol="features", labelCol="label")
rf = RandomForestRegressor(featuresCol="features", labelCol="label")
gbt = GBTRegressor(featuresCol="features", labelCol="label")

Pipelines in Spark MLlib are powerful tools for building machine
learning workflows. They allow you to chain multiple
transformation and model training stages, ensuring that all the steps
are executed in the correct order. This is particularly useful for data
preprocessing, feature transformation, and model training and
evaluation to be bundled together into a single, reusable workflow.
Pipelines also help in maintaining clean code and facilitate the
deployment and reuse of machine learning models.
The following code snippet creates five separate pipelines using
Apache Spark’s Pipeline API, with each pipeline containing one of
the machine learning regression models shown previously:

In [8]:
# Define a pipeline for each regression algorithm
pipeline_lr = Pipeline(stages=[lr])
pipeline_glr = Pipeline(stages=[glr])
pipeline_dt = Pipeline(stages=[dt])
pipeline_rf = Pipeline(stages=[rf])
pipeline_gbt = Pipeline(stages=[gbt])

The following code snippet trains the machine learning models on
a dataset. Each line of code fits a different regression model to the
training data, creating trained models that can be used for
predictions:

In [9]:
# Fit the pipelines
model_lr = pipeline_lr.fit(trainingData)
model_glr = pipeline_glr.fit(trainingData)
model_dt = pipeline_dt.fit(trainingData)
model_rf = pipeline_rf.fit(trainingData)
model_gbt = pipeline_gbt.fit(trainingData)

25/11/13 18:15:03 WARN Instrumentation: [47478497] regParam is zero, which might cause numerical instability and overfitting.
25/11/13 18:15:03 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
25/11/13 18:15:03 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
25/11/13 18:15:04 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
25/11/13 18:15:04 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
25/11/13 18:15:05 WARN Instrumentation: [0532d87c] regParam is zero, which might cause numerical instability and overfitting.


The following code snippet takes the model that’s been trained on
the training data and applies it to unseen test data to evaluate the
model’s performance. The transform method in Spark MLlib is
used for this purpose, and it adds a column (typically named
prediction) to the input DataFrame that contains the predicted
values for each row:

In [10]:
# Make predictions
predictions_lr = model_lr.transform(testData)
predictions_glr = model_glr.transform(testData)
predictions_dt = model_dt.transform(testData)
predictions_rf = model_rf.transform(testData)
predictions_gbt = model_gbt.transform(testData)

The following code snippet evaluates the performance of different
regression models on the test dataset using the RMSE metric:


In [11]:
# Evaluate the models
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")

rmse_lr = evaluator.evaluate(predictions_lr)
rmse_glr = evaluator.evaluate(predictions_glr)
rmse_dt = evaluator.evaluate(predictions_dt)
rmse_rf = evaluator.evaluate(predictions_rf)
rmse_gbt = evaluator.evaluate(predictions_gbt)

print("Linear Regression RMSE:", rmse_lr)
print("General Linear Regression RMSE:", rmse_glr)
print("Decision Tree Regression RMSE:", rmse_dt)
print("Random Forest Regression RMSE:", rmse_rf)
print("Gradient Boosted Tree Regression RMSE:", rmse_gbt)

Linear Regression RMSE: 1.1394418865304807
General Linear Regression RMSE: 1.1394418865298508
Decision Tree Regression RMSE: 0.7237963286049708
Random Forest Regression RMSE: 0.6875829969719259
Gradient Boosted Tree Regression RMSE: 0.5676564538704333


In [12]:
predictions_lr.show(5,60)

+------+--------+------------------+------------------+----------+------------------+--------+---------+-----+------------------------------------------------------------+------------------+
|MedInc|HouseAge|          AveRooms|         AveBedrms|Population|          AveOccup|Latitude|Longitude|label|                                                    features|        prediction|
+------+--------+------------------+------------------+----------+------------------+--------+---------+-----+------------------------------------------------------------+------------------+
|0.4999|    46.0|1.7142857142857142|0.5714285714285714|      18.0|2.5714285714285716|   37.81|  -122.29|0.675|[0.4999,46.0,1.7142857142857142,0.5714285714285714,18.0,2...| 1.062330373214671|
|0.8172|    52.0| 6.102459016393443|1.3729508196721312|     728.0|2.9836065573770494|   37.82|  -122.28|0.853|[0.8172,52.0,6.102459016393443,1.3729508196721312,728.0,2...|1.3010006477491558|
|0.8668|    52.0|2.4431818181818183|0.9886363

### Summary
Regression analysis plays a crucial role in supervised learning,
enabling us to understand relationships between variables and
make informed predictions.