<img src="images/cads-logo.png" style="height: 100px;" align=left> <img src="images/apache_spark.png" style="height: 20%;width:20%" align=right>

MC: https://colab.research.google.com/drive/18pXntjU2friBLOh4yJ8MdkHiUZ9gi8bd?usp=sharing

# Linear Regression
In this notebook, we are going to look at another commonly used Machine Learning technique called Linear Regression. Linear Regression is useful when we have data in which we believe we can make predictions about one variable using knowledge about another variable. For example, if we think knowing CPU utilization will allow us to predict what the number of sessions are, or the free memory are, then the linear regression technique whould be a good technique to use to implement that.

In this part, we will use utilization data. 

In [4]:
!pip install pyspark
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 67kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 39.9MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612242 sha256=792b208a09b705f02c699c0fd38fe20b41f14c4bfe6ba125e6c1eded2548aea5
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


In [5]:
spark = SparkSession.builder.getOrCreate()

In [None]:
import os
MAIN_DIRECTORY = os.getcwd()
file_path =MAIN_DIRECTORY+"/data/utilization.json"
df_util = spark.read.format("json").load(file_path)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
cd "/content/drive/MyDrive/UM Lecture/CADS/13 BDA with Apache Spark 2day"

/content/drive/MyDrive/UM Lecture/CADS/13 BDA with Apache Spark 2day


In [9]:
df_util = spark.read.format("json").load("data/utilization.json")

In this task, we are going to make prediction based on CPU utilization. So to do that, first, we should create a VectorAssembler.

In [10]:
df_util.count()

500000

In [11]:
df_util.show(5)

+---------------+-------------------+-----------+---------+-------------+
|cpu_utilization|     event_datetime|free_memory|server_id|session_count|
+---------------+-------------------+-----------+---------+-------------+
|           0.77|03/16/2019 17:21:40|       0.22|      115|           58|
|           0.53|03/16/2019 17:26:40|       0.23|      115|           64|
|            0.6|03/16/2019 17:31:40|       0.19|      115|           82|
|           0.46|03/16/2019 17:36:40|       0.32|      115|           60|
|           0.77|03/16/2019 17:41:40|       0.49|      115|           84|
+---------------+-------------------+-----------+---------+-------------+
only showing top 5 rows



Now, the next thing, we want to do is create a data structure that has a linear regression model, which we can later fit our data to it.

Now our Linear Regression Model is specified by two properties, the coefficients and the intercept. 

In [15]:
vecAssembler = VectorAssembler(inputCols = ['cpu_utilization','free_memory'],outputCol = 'features')

In [16]:
vec_df = vecAssembler.transform(df_util)

One of the things, we often want to know, when we are building a predictive model is the error that occures when we fit that model. Because the line is not going to fit exactly all of the data points. So what we often use for a measure is the Root Means Squared Error.

In [17]:
lr = LinearRegression(featuresCol='features',labelCol = 'session_count')

In [18]:
LinRegModel = lr.fit(vec_df)

In [20]:
LinRegModel.coefficients

DenseVector([32.0832, -31.8455])

In [22]:
LinRegModel.intercept

61.76149951889013

In [23]:
LinRegModel.summary.rootMeanSquaredError

12.042582333120887

#### Well Done!