## Linear Regression with PySpark

Machine Learning with pySpark follows a slightly different train/test configuration than pandas/sklearn. Features used to train the module cannot be fed individually, but instead has to be merged into a single 1D array/vector.

In [40]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression


In [41]:
df = spark.read.csv("Ecommerce_Customers.csv", header=True, inferSchema=True)

In [42]:
df.show(5)

+--------------------+--------------------+------------------+-----------+---------------+--------------------+-------------------+
|               Email|             Address|Avg Session Length|Time on App|Time on Website|Length of Membership|Yearly Amount Spent|
+--------------------+--------------------+------------------+-----------+---------------+--------------------+-------------------+
|mstephenson@ferna...|835 Frank TunnelW...|       34.49726773|12.65565115|    39.57766802|         4.082620633|         587.951054|
|   hduke@hotmail.com|4547 Archer Commo...|       31.92627203|11.10946073|    37.26895887|         2.664034182|        392.2049334|
|    pallen@yahoo.com|24645 Valerie Uni...|       33.00091476|11.33027806|    37.11059744|         4.104543202|        487.5475049|
|riverarebecca@gma...|1414 David Throug...|       34.30555663|13.71751367|    36.72128268|         3.120178783|         581.852344|
|mstephens@davidso...|14023 Rodriguez P...|       33.33067252|12.79518855|  

### Merging all of the features column into a single 1D Array

In [44]:
output.show(5)

+--------------------+--------------------+------------------+-----------+---------------+--------------------+-------------------+--------------------+
|               Email|             Address|Avg Session Length|Time on App|Time on Website|Length of Membership|Yearly Amount Spent|Independent Features|
+--------------------+--------------------+------------------+-----------+---------------+--------------------+-------------------+--------------------+
|mstephenson@ferna...|835 Frank TunnelW...|       34.49726773|12.65565115|    39.57766802|         4.082620633|         587.951054|[34.49726773,12.6...|
|   hduke@hotmail.com|4547 Archer Commo...|       31.92627203|11.10946073|    37.26895887|         2.664034182|        392.2049334|[31.92627203,11.1...|
|    pallen@yahoo.com|24645 Valerie Uni...|       33.00091476|11.33027806|    37.11059744|         4.104543202|        487.5475049|[33.00091476,11.3...|
|riverarebecca@gma...|1414 David Throug...|       34.30555663|13.71751367|    36.7

In [45]:
output.select("Independent Features").show(5)

+--------------------+
|Independent Features|
+--------------------+
|[34.49726773,12.6...|
|[31.92627203,11.1...|
|[33.00091476,11.3...|
|[34.30555663,13.7...|
|[33.33067252,12.7...|
+--------------------+
only showing top 5 rows



As we can see, the individual feature columns are merged into a 1D array / list

In [46]:
finalized_data = output.select("Independent Features", "Yearly Amount Spent")
finalized_data.show()

+--------------------+-------------------+
|Independent Features|Yearly Amount Spent|
+--------------------+-------------------+
|[34.49726773,12.6...|         587.951054|
|[31.92627203,11.1...|        392.2049334|
|[33.00091476,11.3...|        487.5475049|
|[34.30555663,13.7...|         581.852344|
|[33.33067252,12.7...|         599.406092|
|[33.87103788,12.0...|        637.1024479|
|[32.0215955,11.36...|        521.5721748|
|[32.73914294,12.3...|        549.9041461|
|[33.9877729,13.38...|         570.200409|
|[31.93654862,11.8...|        427.1993849|
|[33.99257277,13.3...|        492.6060127|
|[33.87936082,11.5...|        522.3374046|
|[29.53242897,10.9...|        408.6403511|
|[33.19033404,12.9...|        573.4158673|
|[32.38797585,13.1...|        470.4527333|
|[30.73772037,12.6...|        461.7807422|
|[32.1253869,11.73...|        457.8476959|
|[32.33889932,12.0...|        407.7045475|
|[32.18781205,14.7...|        452.3156755|
|[32.61785606,13.9...|        605.0610388|
+----------

### Train / Test Split

In [47]:
train_data, test_data = finalized_data.randomSplit([0.8, 0.2])

### Creating a Linear Regression Model

In [48]:
regressor = LinearRegression(
    featuresCol="Independent Features", labelCol="Yearly Amount Spent"
)
regressor = regressor.fit(train_data)

### Retrieving the regression model coefficients and intercept

In [49]:
regressor.coefficients

DenseVector([25.5669, 38.8141, 0.7065, 61.5741])

In [50]:
regressor.intercept

-1057.194599850707

### View predicted results

In [51]:
pred_results = regressor.evaluate(test_data)
pred_results.predictions.show()

+--------------------+-------------------+------------------+
|Independent Features|Yearly Amount Spent|        prediction|
+--------------------+-------------------+------------------+
|[30.4925367,11.56...|        282.4712457| 287.8592403602297|
|[31.06132516,12.3...|        487.5554581| 493.9486499925158|
|[31.12397435,12.3...|        486.9470538|508.55688997600237|
|[31.1695068,13.97...|        427.3565308| 417.7863405911853|
|[31.26810421,12.1...|        423.4705332| 427.3595065454251|
|[31.44597248,12.8...|        484.8769649|482.75670323646114|
|[31.5171218,10.74...|        275.9184207| 281.0526030466485|
|[31.53160448,13.3...|        436.5156057|433.85892587759804|
|[31.57613197,12.5...|         541.226584| 543.6486115846994|
|[31.7207699,11.75...|        538.7749335| 546.5630685071342|
|[31.80930032,11.6...|        536.7718994| 548.0735225358278|
|[31.81248256,10.8...|         392.810345|396.01909073501633|
|[31.82934646,11.2...|         385.152338| 384.3995943418879|
|[31.904

### Model Evaluation

In [52]:
trainingSummary = regressor.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 9.767497
r2: 0.984700
