# Linear Regression in PySpark

This Notebook was originally created in Databricks. You can sign up for the free community edition of Databricks [here](https://community.cloud.databricks.com/) then import this notebook.  

This is a very basic introduction on how to build a linear regression model on Spark using Python.  

Here are reference docs on Linear Regression in PySpark.  

- https://spark.apache.org/docs/latest/mllib-linear-methods.html#regression
- https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression
- https://docs.databricks.com/spark/latest/mllib/index.html

In [2]:
import numpy as np

# generate a random and uniform 2D matrix of correlated data
# source: https://stackoverflow.com/a/18684433/5356898

xx = np.array([-0.51, 51.2])
yy = np.array([0.33, 51.6])
means = [xx.mean(), yy.mean()]  
stds = [xx.std() / 3, yy.std() / 3]
corr = 0.8 # correlation
covs = [[stds[0]**2          , stds[0]*stds[1]*corr], 
        [stds[0]*stds[1]*corr,           stds[1]**2]] 

data = np.random.multivariate_normal(means, covs, 1000)

In [3]:
data.shape

In [4]:
data

In [5]:
rdd1 = sc.parallelize(data)
rdd2 = rdd1.map(lambda x: [float(i) for i in x])
df = rdd2.toDF(["y","x"])

In [6]:
display(df)

In [7]:
from pyspark.ml.regression import LinearRegression, LinearRegressionSummary
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline

assembler = VectorAssembler(inputCols=["x"], outputCol="features")

lr = LinearRegression(labelCol="y")

pipeline = Pipeline(stages=[assembler, lr])

train, test = df.randomSplit([0.75, 0.25])

model = pipeline.fit(train)

predictions = model.transform(test)

eval = RegressionEvaluator(labelCol="y", predictionCol="prediction")

# uncomment below for help
#help(eval)
#for line in eval.explainParams().split('\n'):
#  print(line)

print('RMSE:', eval.evaluate(predictions, {eval.metricName: "rmse"}))
print('R-squared:', eval.evaluate(predictions, {eval.metricName: "r2"}))