[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/jkanclerz/data-science-workshop-2024/blob/main/40--ml-regression.ipynb)

In [1]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz -O spark-3.5.0-bin-hadoop3.tgz
!tar xf spark-3.5.0-bin-hadoop3.tgz

/usr/local/bin/bash: line 1: apt-get: command not found
--2021-12-11 07:51:58--  https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300965906 (287M) [application/x-gzip]
Saving to: ‘spark-3.2.0-bin-hadoop3.2.tgz’

spark-3.2.0-bin-hadoop3.2/jars/hadoop-client-runtime-3.3.1.jar: truncated gzip input
tar: Error exit delayed from previous errors.


In [None]:
!pip install -q pyspark findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Test it")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()
sc = spark.sparkContext

23/01/21 02:00:19 WARN Utils: Your hostname, Jakubs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.8.5 instead (on interface en0)
23/01/21 02:00:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/21 02:00:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/21 02:00:21 WARN Utils: Service 'SparkUI' could not bind on port 4050. Attempting port 4051.


In [2]:
df = spark.read.parquet("employees.parquet")

                                                                                

# Features

In [3]:
from pyspark.ml.feature import VectorAssembler

In [4]:
va = VectorAssembler(inputCols = ['wiek'], outputCol = 'features')

In [5]:
reg_df = va.transform(df)

In [6]:
reg_df.select('features', 'wynagrodzenie').show(3)

+--------+-------------+
|features|wynagrodzenie|
+--------+-------------+
|  [34.0]|         7100|
|  [59.0]|        11700|
|  [60.0]|        11500|
+--------+-------------+
only showing top 3 rows




[Stage 1:>                                                          (0 + 1) / 1]

                                                                                

# Model

https://spark.apache.org/docs/latest/ml-classification-regression.html
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression

In [7]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

## train vs test

In [8]:
(train_df, test_df) = reg_df.randomSplit([0.7, 0.3])
 
lr = LinearRegression(featuresCol='features',
                      labelCol='wynagrodzenie')
 
lr_model = lr.fit(train_df)

23/01/21 02:00:45 WARN Instrumentation: [fdddf5b0] regParam is zero, which might cause numerical instability and overfitting.
23/01/21 02:00:46 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/01/21 02:00:46 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
23/01/21 02:00:46 WARN InstanceBuilder$NativeLAPACK: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


In [9]:
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Coefficients: [200.00331640934172]
Intercept: -13.854504539602958


$$\hat{Y} = X_{1}{\beta}_{1} + {\beta}_0 \$$
```md
$$\hat{Y} = X_{1}{\beta}_{1} + {\beta}_0 \$$
```

$$\hat{Wynagrodzenie} = 199,3979 * wiek + 9 \$$

In [10]:
trainingSummary = lr_model.summary

In [11]:
type(trainingSummary)

pyspark.ml.regression.LinearRegressionTrainingSummary

https://spark.apache.org/docs/2.3.2/api/java/org/apache/spark/ml/regression/LinearRegressionTrainingSummary.html

In [12]:
print("R2: %f" % trainingSummary.r2)

R2: 0.982710


In [13]:
lr_predictions = lr_model.transform(test_df)

In [14]:
lr_predictions.select("prediction","wynagrodzenie","features").show(10)

+-----------------+-------------+--------+
|       prediction|wynagrodzenie|features|
+-----------------+-------------+--------+
| 4986.22840569394|         4600|  [25.0]|
| 4986.22840569394|         4800|  [25.0]|
| 4986.22840569394|         4900|  [25.0]|
| 4986.22840569394|         5000|  [25.0]|
| 4986.22840569394|         5300|  [25.0]|
| 4986.22840569394|         5500|  [25.0]|
|5186.231722103282|         4700|  [26.0]|
|5186.231722103282|         4700|  [26.0]|
|5186.231722103282|         4800|  [26.0]|
|5186.231722103282|         4800|  [26.0]|
+-----------------+-------------+--------+
only showing top 10 rows



## ocena danych testowych

In [15]:
from pyspark.ml.evaluation import RegressionEvaluator

In [16]:
# R2:
lr_evaluator = RegressionEvaluator(predictionCol="prediction",
                                   labelCol="wynagrodzenie",
                                   metricName="r2")
 
print("R2 on test data = %g" % lr_evaluator.evaluate(lr_predictions))

R2 on test data = 0.983389


In [17]:
sc.stop()