# 03 Linear Regression

## 3.1 The `MLlib` Library

PySpark comes equipped with the `MLlib`, Spark's own machine learning library (the documentation can be found here <a href="https://spark.apache.org/docs/latest/ml-guide.html">MLlib: Main Guide</a>). As a quick overview of how the library is designed, here are the key principals:

1) A `Transformer` is an abstraction for something that transforms a dataframe. The `Transformer` concept is an umbrella term for 2 kinds of transformers: feature transformers and learned models. A feature transformer will generate new features based on existing features, while a learned model will generate a prediction column based on some feature columns.

2) An `Estimator` is an abstraction for the concept of a learning algorithm (or any algorithm that fits/trains on the data). An `Estimator` implements an abstract method called `fit()` which takes in a dataframe and outputs a `Transformer` (the fitted model). In Java vernacular, the `LinearRegression` class implements the `Estimator` class and produces a `LinearRegressionModel` object.

3) A `Pipeline` is a sequential collection of `Transformer` objects and `Estimators` objects. The idea is that the dataframe is undergoes processing by various `Transformer` objects, then is fed into an `Estimator` object to produce a fitted model; the output should be a fitted model (hence a `Transformer` object). A `Pipeline` packages all the data processing and model fitting into a single object. This is useful for data that requires very involved processing steps like text-data and images.

## 3.1 The Sample Data

In [1]:
# initiate new spark session
from pyspark.sql import SparkSession

DATA_PATH = "../course_materials/Spark_for_Machine_Learning/Linear_Regression/"

spark = SparkSession.builder.appName("Linear Regression").getOrCreate()

In [2]:
# load data
df_sample = spark.read.format("libsvm").load(DATA_PATH + "sample_linear_regression_data.txt")

df_sample.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -7.966593841555266|(10,[0,1,2,3,4,5,...|
| -7.896274316726144|(10,[0,1,2,3,4,5,...|
| -8.464803554195287|(10,[0,1,2,3,4,5,...|
| 2.1214592666251364|(10,[0,1,2,3,4,5,...|
| 1.0720117616524107|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -5.082010756207233|(10,[0,1,2,3,4,5,...|
|  7.887786536531237|(10,[0,1,2,3,4,5,...|
| 14.323146365332388|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-0.8995693247765151|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|  5.601801561245534|(10,[0,1,2,3,4,5,...|
|-3.2256352187273354|(10,[0,1,2,3,4,5,...|
| 1.5299675726687754|(10,[0,1,2,3,4,5,...|
| -0.250102447941961|(10,[0,1,2,3,4,5,...|
+----------

Notice in particular: the `features` column seems to contain an entire row vector in each entry. This is indeed the case and we can verify with `printSchema()`:

In [3]:
df_sample.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



The reason for this is because `Estimator` objects expect all the features to be packaged into a single row vector and will look for a single column containing that vector. Note the `10` in the front of each vector indicates the size of the vector; in this case each row consists of a 10-dimensional vector.

## 3.2 The `LinearRegression` Estimator And `LinearRegressionModel` Transformer

The `LinearRegression` estimator is used to build linear regression models using gradient descent. 

In [4]:
from pyspark.ml.regression import LinearRegression

In [5]:
# instantiate a LinearRegression estimator
lr_estimator = LinearRegression(
    featuresCol = "features",
    labelCol = "label",
    predictionCol = "prediction"
)

# use the estimator to fit a model
lr_model = lr_estimator.fit(df_sample)

The `LinearRegression` estimator is an object with a `fit()` method. The `fit()` method returns a `LinearRegressionModel` object and this object has `coefficient` and `intercept` attributes.

In [6]:
# view fitted coefficients and intercepts
print(lr_model.coefficients)

print(lr_model.intercept)

[0.0073350710225801715,0.8313757584337543,-0.8095307954684084,2.441191686884721,0.5191713795290003,1.1534591903547016,-0.2989124112808717,-0.5128514186201779,-0.619712827067017,0.6956151804322931]
0.14228558260358093


We can look at various goodness-of-fit metrics by looking at a `LinearRegressionSummary` object:

In [7]:
# extract a LinearRegressionSummary object
model_summary = lr_model.summary

# look at model R^2
print(model_summary.r2)

# look at model MSE
print(model_summary.meanSquaredError)

# look at model Root MSE
print(model_summary.rootMeanSquaredError)

# look at model MAE
print(model_summary.meanAbsoluteError)

# look at the standard errors of the coefficient estimates
# note that the last element corresponds to the intercept
print(model_summary.coefficientStandardErrors)

# look at the p-values of the coefficient estimates
print(model_summary.pValues)

# view the predicted outputs of each row in the training data
model_summary.predictions.show()

0.027839179518600154
103.28843028724194
10.16309157133015
8.145215527783876
[0.8068799897284031, 0.817504726374364, 0.8095690515350549, 0.8435177715251213, 0.8066923009911938, 0.8238680228428261, 0.8041910472918519, 0.8095101717564966, 0.8428997032677101, 0.788760166505627, 0.46405794834415603]
[0.9927505031240562, 0.30967074330990396, 0.3178269194409711, 0.003972477331573909, 0.5201486327242175, 0.16213017210149872, 0.7102819001865635, 0.5266812209137877, 0.46256007153356316, 0.37825808848978526, 0.7592692146070568]
+-------------------+--------------------+--------------------+
|              label|            features|          prediction|
+-------------------+--------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|  1.5211201432720063|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...| -0.6658770747591632|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|  0.1568703823211514|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|  0.6374146679690593|
| -7.966593841555266|(10,[0,1

Recall that fitted models are implementations of an abstract `Transformer` class because they "transform" input features into predictions. We can generate predictions using the `Transform` method:

In [8]:
# generate predictions from some input data
predictions = lr_model.transform(df_sample)

predictions.show()

+-------------------+--------------------+--------------------+
|              label|            features|          prediction|
+-------------------+--------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|  1.5211201432720063|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...| -0.6658770747591632|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|  0.1568703823211514|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|  0.6374146679690593|
| -7.966593841555266|(10,[0,1,2,3,4,5,...|   2.372566473232916|
| -7.896274316726144|(10,[0,1,2,3,4,5,...| -1.9410651727650883|
| -8.464803554195287|(10,[0,1,2,3,4,5,...|  2.2621027950886363|
| 2.1214592666251364|(10,[0,1,2,3,4,5,...|-0.00134792656609...|
| 1.0720117616524107|(10,[0,1,2,3,4,5,...| -3.0051104606414007|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|  3.5437265095387804|
| -5.082010756207233|(10,[0,1,2,3,4,5,...| -0.4889664122481736|
|  7.887786536531237|(10,[0,1,2,3,4,5,...|  1.5073098457843013|
| 14.323146365332388|(10,[0,1,2,3,4,5,..

<br>

---

<br>

## 3.3 Vectors and Vector Assemblers

As noted in the previous section, `Estimator` objects only take in a single column as the feature. This is because MLlib requires us to assemble all of our input values into a single array called a `Vector`. Each row of the training data has a single feature `Vector` as input. To assemble multiple columns of a dataframe into a single `Vector`, we use a feature transformer called a `VectorAssembler`.

In [9]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# load "unprepared" data
df_ecomm = spark.read.csv(DATA_PATH + "Ecommerce_Customers.csv", inferSchema = True, header = True)

In [10]:
print("Num. Rows: " + str(df_ecomm.count()))
df_ecomm.printSchema()
df_ecomm.show()

Num. Rows: 500
root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)

+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|               Email|             Address|          Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|
+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|mstephenson@ferna...|835 Frank TunnelW...|          Violet| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9

The dataframe contains ecommerce data. Let us build a linear regression model that predicts `Yearly Amount Spent` based on the `Avg Session Length`, `Time on App`, and `Length of Membership`. Notice that these predictor variables *are not* assembled into a single feature vector yet. To do this, use the `VectorAssembler` which is a **feature transformer** because it transforms our features into a new feature.

In [11]:
inputCols = [
    "Avg Session Length",
    "Time on App",
    "Time on Website",
    "Length of Membership"
]

# Instantiate a VectorAssembler object
# by specifying the input cols
assembler = VectorAssembler(
    inputCols = inputCols,
    outputCol = "features" # name of the column to assemble to
)

# assemble our feature cols into a single vector of features
# by calling the transform(); this takes in our original dataframe
# and returns a new transformed dataframe
df_ecomm_assembled = assembler.transform(df_ecomm)

# view the new dataframe
df_ecomm_assembled.show(5)

+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+
|               Email|             Address|          Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|            features|
+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+
|mstephenson@ferna...|835 Frank TunnelW...|          Violet| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9510539684005|[34.4972677251122...|
|   hduke@hotmail.com|4547 Archer Commo...|       DarkGreen| 31.92627202636016|11.109460728682564|37.268958868297744|    2.66403418213262|  392.2049334443264|[31.9262720263601...|
|    pallen@yahoo.com|24645 Valerie Uni...|          Bisque|33.000914755642675|11.330278057777512|37

Notice that the `VectorAssembler` transforms our original dataframe by creating a new column named `features` which contains a 4-dimensional vector of input values. We are now ready to fit a new linear regression model to this data

In [12]:
# instantiate a new LinearRegressionEstimator
lr_estimator = LinearRegression(
    featuresCol = "features",
    labelCol = "Yearly Amount Spent",
    predictionCol = "prediction"
)

# fit a linear regression model to the data
lr_model = lr_estimator.fit(df_ecomm_assembled)

Let's examine the fittd parameters, $R^2$, and standard errors value of the fitted model:

In [13]:
lr_summary = lr_model.summary

print(f"Model R2: {lr_summary.r2}")

results = zip(
    inputCols,
    lr_model.coefficients,
    lr_summary.coefficientStandardErrors[0:4],
    lr_summary.pValues[0:4]
)

for result in results:
    print(f"{result[0]}: {round(result[1],2)}  ,  SE: {round(result[2],2)}  ,  p-value: {round(result[3], 2)}")

Model R2: 0.9843155370226727
Avg Session Length: 25.73  ,  SE: 0.45  ,  p-value: 0.0
Time on App: 38.71  ,  SE: 0.45  ,  p-value: 0.0
Time on Website: 0.44  ,  SE: 0.44  ,  p-value: 0.33
Length of Membership: 61.58  ,  SE: 0.45  ,  p-value: 0.0


This model explains 98.4% of the variance in `Yearly Amount Spent` (which is really good!). `Avg Session Length`, `Time on App`, and `Length of Membership` are all positively correlated with `Yearly Amount Spent` once we control for each of them in the model; this isn't too surprising. What is surprising is that `Time on Website` does not seem to have a strong relationship with `Yearly Amount Spent` in the presence of the other variables. This is could possibly be explained by multicollinearity between `Time on Website` and the other features.

<br>

---

<br>

## 3.4 Train-Test Split

Generally speaking, linear models come in 2 flavors:

1) Models for **inference** are models meant to study real world relationships between observed variables. Models in classical statistics / econometrics typically fall into this category. Inferential models mainly care about interpretability and the standard errors of the estimated coefficients. For this reason, inferential models are generally fit on *all* of the available data. Standard errors are computed by making strong mathematical assumptions about the underlying data generation process, so special attention must be paid to checking whether the observed data seems to align with the assumptions being made.

2) Models for **prediction** are models meant to generating a predicted value. Models in machine learning typically fall into this category. Predictive models mainly care about **generalization accuracy**, i.e. how accurately the model predicts on *new* data. For this reason, the data set is usually split into a *training* set and a *testing* set. The model is trained on the training set, then asked to make predictions on the testing set. The accuracy on the testing set serves as an estimate for the models generalization accuracy and this is usually measured by the **mean squared error** of the testing data set.

The 2 flavors of linear models do not inherently conflict with each other, but they do generally require the model builder to evaluate the models using different metrics. This in important to keep in mind because there is no "one-size-fits-all" approach to model building. The things we do when building a model will always be dictated by what we want the use the model for.

Suppose we wanted to build a predictive model. Then what we care about is the **generalization error**, i.e. how accurate the model is on never-before-seen data. Realistically, it's not possible to know how the model would perform on data we don't have. So the next best alternative is *estimate* the generalization error by holding out a subset of the data to test the model on. This procedure is called a **train-test split** and we can do this split on a Spark dataframe by using the `randomSplit()` method.

In [14]:
# train-test split w/ 70% of the data in training set
# and the remaining 30% of the data in the testing set;
# note that randomSplit() returns a 2-tuple so we can
# "unpack" the tupleby assigning both
# train and test dataframes at once
df_train, df_test = df_ecomm_assembled.randomSplit([0.7, 0.3])

df_train.show()

+--------------------+--------------------+--------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+
|               Email|             Address|        Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|            features|
+--------------------+--------------------+--------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+
|   aaron04@yahoo.com|16338 Scott Corne...|      SeaGreen| 33.70511279750195|10.163179060052556|37.763041081545246|   4.778973636034999|  521.2407802357949|[33.7051127975019...|
|    aaron11@luna.com|672 Jesus Roads A...|  LightSkyBlue| 32.44952156114242| 13.45772494051235| 37.23880567308968|   2.941410754428091|  503.9783790525795|[32.4495215611424...|
|   aaron22@gmail.com|38678 Sean Drive ...|      DarkGray|33.452295280190306|12.005916370756164| 36.5340956708

Note that we performed the train-test split on the dataframe where we had `feature` column already assembled in vector form; this is generally the recommended workflow.

In [15]:
# fit the model on training data
lr_model = lr_estimator.fit(df_train)

# evaluate the model on test data
test_results = lr_model.evaluate(df_test)

We can now view how the model is expected to perform on never-before-seen data by viewing the prediction errors on the testing data set:

In [16]:
print(f"Test MSE: {test_results.meanSquaredError}")
print(f"Test RMSE: {test_results.rootMeanSquaredError}")
print(f"Test MAE: {test_results.meanAbsoluteError}")

df_test.agg( {"Yearly Amount Spent" : "mean"} ).show()

Test MSE: 130.44316439712264
Test RMSE: 11.421171761125153
Test MAE: 8.933644031864349
+------------------------+
|avg(Yearly Amount Spent)|
+------------------------+
|      500.40302478454294|
+------------------------+



The Mean Absolute Error is ~8.5 so the model is +/- 8.5 on average from the correct value in the testing data. This seems pretty good considering the average `Yearly Amount Spent` in the test data is ~495 (so the model is +/- 1.7% on average).