<a href="https://colab.research.google.com/github/m-mehdi/Python101/blob/master/Apache_Spark_05_Regression_CIMB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="images/cads-logo.png" style="height: 100px;" align=left> <img src="images/apache_spark.png" style="height: 20%;width:20%" align=right>

# Linear Regression
In this notebook, we are going to look at another commonly used Machine Learning technique called Linear Regression. Linear Regression is useful when we have data in which we believe we can make predictions about one variable using knowledge about another variable. For example, if we think knowing CPU utilization will allow us to predict what the number of sessions are, or the free memory are, then the linear regression technique whould be a good technique to use to implement that.

In this part, we will use utilization data. 

In [1]:
!pip install pyspark
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 68kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 41.7MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=2de41dfb0768ad01dabcab3e983ec283a945dc5c5bc23e0a068c7d815a932db6
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


In [2]:
spark = SparkSession.builder.getOrCreate()

In [3]:
spark

In [7]:
import os
MAIN_DIRECTORY = os.getcwd()
file_path =MAIN_DIRECTORY+"/Data/utilization.json"
df_util = spark.read.format("json").load(file_path)

In [5]:
df_util.show()

+---------------+-------------------+-----------+---------+-------------+
|cpu_utilization|     event_datetime|free_memory|server_id|session_count|
+---------------+-------------------+-----------+---------+-------------+
|           0.57|03/05/2019 08:06:14|       0.51|      100|           47|
|           0.47|03/05/2019 08:11:14|       0.62|      100|           43|
|           0.56|03/05/2019 08:16:14|       0.57|      100|           62|
|           0.57|03/05/2019 08:21:14|       0.56|      100|           50|
|           0.35|03/05/2019 08:26:14|       0.46|      100|           43|
|           0.41|03/05/2019 08:31:14|       0.58|      100|           48|
|           0.57|03/05/2019 08:36:14|       0.35|      100|           58|
|           0.41|03/05/2019 08:41:14|        0.4|      100|           58|
|           0.53|03/05/2019 08:46:14|       0.35|      100|           62|
|           0.51|03/05/2019 08:51:14|        0.6|      100|           45|
|           0.32|03/05/2019 08:56:14| 

In [8]:
df_util.count()

500000

In this task, we are going to make prediction based on CPU utilization. So to do that, first, we should create a VectorAssembler.

In [9]:
vecAssembler = VectorAssembler(inputCols=['cpu_utilization','free_memory'],outputCol='features')

In [10]:
vec_df = vecAssembler.transform(df_util)

In [11]:
vec_df.show()

+---------------+-------------------+-----------+---------+-------------+-----------+
|cpu_utilization|     event_datetime|free_memory|server_id|session_count|   features|
+---------------+-------------------+-----------+---------+-------------+-----------+
|           0.57|03/05/2019 08:06:14|       0.51|      100|           47|[0.57,0.51]|
|           0.47|03/05/2019 08:11:14|       0.62|      100|           43|[0.47,0.62]|
|           0.56|03/05/2019 08:16:14|       0.57|      100|           62|[0.56,0.57]|
|           0.57|03/05/2019 08:21:14|       0.56|      100|           50|[0.57,0.56]|
|           0.35|03/05/2019 08:26:14|       0.46|      100|           43|[0.35,0.46]|
|           0.41|03/05/2019 08:31:14|       0.58|      100|           48|[0.41,0.58]|
|           0.57|03/05/2019 08:36:14|       0.35|      100|           58|[0.57,0.35]|
|           0.41|03/05/2019 08:41:14|        0.4|      100|           58| [0.41,0.4]|
|           0.53|03/05/2019 08:46:14|       0.35|     

Now, the next thing, we want to do is create a data structure that has a linear regression model, which we can later fit our data to it.

In [12]:
lr = LinearRegression(featuresCol='features',labelCol='session_count')

In [13]:
linRegModel = lr.fit(vec_df)

Now our Linear Regression Model is specified by two properties, the coefficients and the intercept. 

In [14]:
linRegModel.coefficients

DenseVector([32.0832, -31.8455])

In [15]:
linRegModel.intercept

61.76149951888711

session_count = 32.0832xcpu_utilization-31.8455xfree_memory+61.76149951888711

One of the things, we often want to know, when we are building a predictive model is the error that occures when we fit that model. Because the line is not going to fit exactly all of the data points. So what we often use for a measure is the Root Means Squared Error.

In [16]:
linRegModel.summary.rootMeanSquaredError

12.042582333120823

#### Well Done!