<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/spark/pyspark_basic_linear_regression_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark Linear Regression Model

## Setup PySpark instance

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system.

In [19]:
#@title ### Setup PySpark instance
#@markdown To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system.

#@markdown **Uppon successful completion of this cell a ``SparkSession`` context named ``spark`` will be available to interact with the service.**

#@markdown Creating multiple ``SparkSession`` or ``SparkContext`` object could 
#@markdown cause issues. If you need to get a reference to the context it is 
#@markdown recommended to use ``SparkSession.builder.getOrCreate()``.


!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

import os
import findspark
# environment variables
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'
os.environ['SPARK_HOME'] = 'spark-2.4.5-bin-hadoop2.7'
# check installation
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark

## Linear Regression Model


Download the Boston housing dataset.

In [0]:
!wget -q https://raw.githubusercontent.com/martin-fabbri/colab-notebooks/master/data/boston.csv

In [0]:
dataset = spark.read.csv('boston.csv', inferSchema=True, header =True)

``SparckSession`` has an attribute called ``catalog`` which list all teh data inside te cluster.

In [23]:
spark.catalog.listTables()

[]

In [24]:
dataset.printSchema()

root
 |-- CRIM: double (nullable = true)
 |-- ZN: double (nullable = true)
 |-- INDUS: double (nullable = true)
 |-- CHAS: integer (nullable = true)
 |-- NX: double (nullable = true)
 |-- RM: double (nullable = true)
 |-- AGE: double (nullable = true)
 |-- DIS: double (nullable = true)
 |-- RAD: integer (nullable = true)
 |-- TAX: double (nullable = true)
 |-- PTRATIO: double (nullable = true)
 |-- B: double (nullable = true)
 |-- LSTAT: double (nullable = true)
 |-- MEDV: double (nullable = true)



Next step is to convert all the features from different columns into a single column and let's call this new vector column as 'Attributes' in the outputCol.

In [31]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NX', \
                                        'RM', 'AGE', 'DIS', 'RAD', 'TAX', \
                                        'PTRATIO', 'B', 'LSTAT'], 
                            outputCol='Attributes')
output = assembler.transform(dataset)

#input vs output
finalized_data = output.select('Attributes', 'MEDV')
finalized_data.show()

+--------------------+----+
|          Attributes|MEDV|
+--------------------+----+
|[0.00632,18.0,2.3...|24.0|
|[0.02731,0.0,7.07...|21.6|
|[0.02729,0.0,7.07...|34.7|
|[0.03236999999999...|33.4|
|[0.06905,0.0,2.18...|36.2|
|[0.02985,0.0,2.18...|28.7|
|[0.08829,12.5,7.8...|22.9|
|[0.14455,12.5,7.8...|27.1|
|[0.21124,12.5,7.8...|16.5|
|[0.17004,12.5,7.8...|18.9|
|[0.22489,12.5,7.8...|15.0|
|[0.11747,12.5,7.8...|18.9|
|[0.09378,12.5,7.8...|21.7|
|[0.62976,0.0,8.14...|20.4|
|[0.63796000000000...|18.2|
|[0.62739,0.0,8.14...|19.9|
|[1.05393,0.0,8.14...|23.1|
|[0.7842,0.0,8.14,...|17.5|
|[0.80271,0.0,8.14...|20.2|
|[0.7258,0.0,8.14,...|18.2|
+--------------------+----+
only showing top 20 rows



Our data vector defines two columns ``Attributes`` and ``MEDV``, input features and targer column respectively. Next, we should split our data training and test before fitting our data. 

In [36]:
train_data, test_data = finalized_data.randomSplit([0.8, 0.2])

regressor = LinearRegression(featuresCol='Attributes', labelCol='MEDV')
regressor = regressor.fit(train_data)
pred = regressor.evaluate(test_data)
pred.predictions.show()

+--------------------+----+------------------+
|          Attributes|MEDV|        prediction|
+--------------------+----+------------------+
|[0.0187,85.0,4.15...|23.1| 25.62529848996516|
|[0.02055,85.0,0.7...|24.7|25.383395723957356|
|[0.02543,55.0,3.7...|23.9| 27.95900238002728|
|[0.02731,0.0,7.07...|21.6|25.421306443814338|
|[0.0315,95.0,1.47...|34.9|30.239757246103377|
|[0.03501999999999...|28.5|33.180222117817465|
|[0.03768,80.0,1.5...|34.6|35.137993589186514|
|[0.04297,52.5,5.3...|24.8|26.752497041852898|
|[0.04301,80.0,1.9...|18.2|14.620407686959577|
|[0.0456,0.0,13.89...|23.3|25.483724867444216|
|[0.04819,80.0,3.6...|21.9|24.516677393482873|
|[0.05187999999999...|22.5| 21.51696034676715|
|[0.05497000000000...|19.0|20.681107473459594|
|[0.05515,33.0,2.1...|36.1| 32.63034110467832|
|[0.05602000000000...|50.0| 35.66183092150246|
|[0.05646,0.0,12.8...|21.2|21.393845387141205|
|[0.06076,0.0,11.9...|23.9|28.083840518530724|
|[0.06211000000000...|22.9|21.105081530614935|
|[0.06417,0.0

In [37]:
#coefficient of the regression model
coeff = regressor.coefficients

#X and Y intercept
intr = regressor.intercept

print ("The coefficient of the model is : %a" %coeff)
print ("The Intercept of the model is : %f" %intr)

The coefficient of the model is : DenseVector([-0.0966, 0.0466, 0.0558, 2.053, -18.3335, 3.9833, 0.0184, -1.2791, 0.2753, -0.0109, -0.9646, 0.0085, -0.5678])
The Intercept of the model is : 34.019999


In [39]:
from pyspark.ml.evaluation import RegressionEvaluator
eval = RegressionEvaluator(labelCol="MEDV", predictionCol="prediction", metricName="rmse")

# Root Mean Square Error
rmse = eval.evaluate(pred.predictions)
print("RMSE: %.3f" % rmse)

# Mean Square Error
mse = eval.evaluate(pred.predictions, {eval.metricName: "mse"})
print("MSE: %.3f" % mse)

# Mean Absolute Error
mae = eval.evaluate(pred.predictions, {eval.metricName: "mae"})
print("MAE: %.3f" % mae)

# r2 - coefficient of determination
r2 = eval.evaluate(pred.predictions, {eval.metricName: "r2"})
print("r2: %.3f" %r2)

RMSE: 4.779
MSE: 22.841
MAE: 3.330
r2: 0.712
