# Linear Regression with Spark
### Predictive Analytics
#### Licence:
You can use this code for anything you may wish only leave this page:
#### AS IS; HOW IS, WHERE IS

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('linear Regression').getOrCreate()

In [4]:
"""Prints the number of r"""
df = spark.read.csv("./data/Linear_regression_dataset.csv", inferSchema=True, header=True)
print((df.count(), len(df.columns))) # prints number of rows by number of columns

(1232, 6)


In [5]:
df.printSchema() # prints the datatype of each column 

root
 |-- var_1: integer (nullable = true)
 |-- var_2: integer (nullable = true)
 |-- var_3: integer (nullable = true)
 |-- var_4: double (nullable = true)
 |-- var_5: double (nullable = true)
 |-- output: double (nullable = true)



In [8]:
df.describe().show() # statistical measures of the dataset

+-------+-----------------+-----------------+------------------+--------------------+--------------------+-------------------+
|summary|            var_1|            var_2|             var_3|               var_4|               var_5|             output|
+-------+-----------------+-----------------+------------------+--------------------+--------------------+-------------------+
|  count|             1232|             1232|              1232|                1232|                1232|               1232|
|   mean|715.0819805194806|715.0819805194806| 80.90422077922078|  0.3263311688311693| 0.25927272727272715|0.39734172077922014|
| stddev| 91.5342940441652|93.07993263118064|11.458139049993724|0.015012772334166148|0.012907228928000298|0.03326689862173776|
|    min|              463|              472|                40|               0.277|               0.214|              0.301|
|    max|             1009|             1103|               116|               0.373|               0.294|     

In [12]:
# We can check the correlation between the input variables and the output variable  using the corr function:
from pyspark.sql.functions import corr
df.select(corr('var_1','output')).show()

+-------------------+
|corr(var_1, output)|
+-------------------+
| 0.9187399607627283|
+-------------------+



var_1 seems to be most strongly correlated with the output column as value close to one indicate strong correlations.

## Feature Engineering
We will now create a single vector combining all input features by using Spark's VectorAssembler.
It merges all input columns into a single feature vector column. You have the freedom to select the number of columns to be used as input columns.

In [16]:
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler
df.columns # returns the number of columns in a dataset
vec_assemblers = VectorAssembler(inputCols=['var_1','var_2','var_3','var_4','var_5'], outputCol='features')
features_df = vec_assemblers.transform(df)
features_df.printSchema()

root
 |-- var_1: integer (nullable = true)
 |-- var_2: integer (nullable = true)
 |-- var_3: integer (nullable = true)
 |-- var_4: double (nullable = true)
 |-- var_5: double (nullable = true)
 |-- output: double (nullable = true)
 |-- features: vector (nullable = true)



In [17]:
# Let us return the features columns and see the contents
features_df.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|[734.0,688.0,81.0...|
|[700.0,600.0,94.0...|
|[712.0,705.0,93.0...|
|[734.0,806.0,69.0...|
|[613.0,759.0,61.0...|
+--------------------+
only showing top 5 rows



Let us now build a linear regressing between the features column and the output column

In [18]:
model_df = features_df.select('features','output') # we create a data frame of two colummns
model_df.show(5)

+--------------------+------+
|            features|output|
+--------------------+------+
|[734.0,688.0,81.0...| 0.418|
|[700.0,600.0,94.0...| 0.389|
|[712.0,705.0,93.0...| 0.417|
|[734.0,806.0,69.0...| 0.415|
|[613.0,759.0,61.0...| 0.378|
+--------------------+------+
only showing top 5 rows



## Spliting the Dataset
We have to split the dataset into a training and test dataset in order to train and evaluate the performance of the Linear Regression model built.
For this project we chose the 7/3 ratio and train our dataset on 705 of the dataset.


In [19]:
train_df,test_df = model_df.randomSplit([0.7,0.3])
print("Number of records in Training set:",train_df.count())
print("Number of records in Testing set:",test_df.count())

Number of records in Training set: 833
Number of records in Testing set: 399


## Building and Training Linear Regression Model

In [20]:
from pyspark.ml.regression import LinearRegression
lin_Reg = LinearRegression(labelCol='output')
lr_model=lin_Reg.fit(train_df)
print(lr_model.coefficients)

[0.0003400352443477631,5.773295009547378e-05,0.00021816169925336954,-0.6609011975909741,0.47957054730876447]


In [21]:
print(lr_model.intercept)
print()
training_predictions=lr_model.evaluate(train_df)
print(training_predictions.r2)

0.18640454130112288

0.8664050443005575


## Evaluating Linear Regression Model on Test Data
Here we check the performance of the model on unseen or test data.

In [22]:
test_predictions = lr_model.evaluate(test_df)
print(test_predictions.r2)
print()
print(test_predictions.meanSquaredError)

0.8747454430013939

0.00013967795192265252
