![](imgs/kodolamaczlogo.png)

# Przetwarzanie Big Data z użyciem Apache Spark

Autor notebooka: Jakub Nowacki.

## Uczenie Maszynowe (Machine Learning) na Spark

Spark ML zawiera wiele algorytmów uczenia maszynowego, które się dobrze działają w sposób rozproszony i się skalują, w tym:

* regresja liniowa,
* regresja logistyczna,
* algorytm random forest,
* algorytm centroidów (k-means).

To API wykorzystuje DataFrames bezpośrednio, więc jest nieco łatwiejsze w użyciu. 

In [1]:
import findspark
findspark.init()

ModuleNotFoundError: No module named 'findspark'

In [2]:
import pyspark
import pyspark.sql.functions as func

spark = pyspark.sql.SparkSession.builder \
    .appName("SparkML") \
    .getOrCreate()
sc = spark.sparkContext

In [3]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

In [6]:
# Ładujemy dane z pliku CSV.
# Szczegółowy opis danych jest dostępny w pliku README
rdd = sc.textFile("data/Bike-Sharing-Dataset/day.csv")

In [7]:
rdd.first()

'instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt'

In [8]:
def from_csv(line):
    cols = line.split(",")
    d = dict()
    d['yr'] = float(cols[3])
    d['workingday'] = float(cols[7])
    d['weathersit'] = float(cols[8])
    d['temp'] = float(cols[9])
    d['hum'] = float(cols[11])
    d['windspeed'] = float(cols[12])
    d['casual'] = float(cols[13])   
    return pyspark.Row(**d)

In [9]:
rows = rdd \
  .filter(lambda line: not line.startswith("instant")) \
  .map(lambda line: from_csv(line))

In [10]:
rows.take(5)

[Row(casual=331.0, hum=0.805833, temp=0.344167, weathersit=2.0, windspeed=0.160446, workingday=0.0, yr=0.0),
 Row(casual=131.0, hum=0.696087, temp=0.363478, weathersit=2.0, windspeed=0.248539, workingday=0.0, yr=0.0),
 Row(casual=120.0, hum=0.437273, temp=0.196364, weathersit=1.0, windspeed=0.248309, workingday=1.0, yr=0.0),
 Row(casual=108.0, hum=0.590435, temp=0.2, weathersit=1.0, windspeed=0.160296, workingday=1.0, yr=0.0),
 Row(casual=82.0, hum=0.436957, temp=0.226957, weathersit=1.0, windspeed=0.1869, workingday=1.0, yr=0.0)]

In [11]:
train_df = spark.createDataFrame(rows)
train_df.show()

+------+--------+--------+----------+---------+----------+---+
|casual|     hum|    temp|weathersit|windspeed|workingday| yr|
+------+--------+--------+----------+---------+----------+---+
| 331.0|0.805833|0.344167|       2.0| 0.160446|       0.0|0.0|
| 131.0|0.696087|0.363478|       2.0| 0.248539|       0.0|0.0|
| 120.0|0.437273|0.196364|       1.0| 0.248309|       1.0|0.0|
| 108.0|0.590435|     0.2|       1.0| 0.160296|       1.0|0.0|
|  82.0|0.436957|0.226957|       1.0|   0.1869|       1.0|0.0|
|  88.0|0.518261|0.204348|       1.0|0.0895652|       1.0|0.0|
| 148.0|0.498696|0.196522|       2.0| 0.168726|       1.0|0.0|
|  68.0|0.535833|   0.165|       2.0| 0.266804|       0.0|0.0|
|  54.0|0.434167|0.138333|       1.0|  0.36195|       0.0|0.0|
|  41.0|0.482917|0.150833|       1.0| 0.223267|       1.0|0.0|
|  43.0|0.686364|0.169091|       2.0| 0.122132|       1.0|0.0|
|  25.0|0.599545|0.172727|       1.0| 0.304627|       1.0|0.0|
|  38.0|0.470417|   0.165|       1.0|    0.301|       1

In [12]:
va = VectorAssembler(inputCols=['yr', 'workingday', 'weathersit', 'temp', 'hum', 'windspeed'], outputCol='features')
t = va.transform(train_df)
t.printSchema()
t.show()

root
 |-- casual: double (nullable = true)
 |-- hum: double (nullable = true)
 |-- temp: double (nullable = true)
 |-- weathersit: double (nullable = true)
 |-- windspeed: double (nullable = true)
 |-- workingday: double (nullable = true)
 |-- yr: double (nullable = true)
 |-- features: vector (nullable = true)

+------+--------+--------+----------+---------+----------+---+--------------------+
|casual|     hum|    temp|weathersit|windspeed|workingday| yr|            features|
+------+--------+--------+----------+---------+----------+---+--------------------+
| 331.0|0.805833|0.344167|       2.0| 0.160446|       0.0|0.0|[0.0,0.0,2.0,0.34...|
| 131.0|0.696087|0.363478|       2.0| 0.248539|       0.0|0.0|[0.0,0.0,2.0,0.36...|
| 120.0|0.437273|0.196364|       1.0| 0.248309|       1.0|0.0|[0.0,1.0,1.0,0.19...|
| 108.0|0.590435|     0.2|       1.0| 0.160296|       1.0|0.0|[0.0,1.0,1.0,0.2,...|
|  82.0|0.436957|0.226957|       1.0|   0.1869|       1.0|0.0|[0.0,1.0,1.0,0.22...|
|  88.0|0.5182

In [13]:
# Regresja liniowa oczekuje wektora zmiennych objaśniających (features) typu Vector (albo SparseVector).
# Nazwa 'features' jest standardowa ale można przekazać inną nazwę kolumny do modelu.
t.select(t.features).show()

+--------------------+
|            features|
+--------------------+
|[0.0,0.0,2.0,0.34...|
|[0.0,0.0,2.0,0.36...|
|[0.0,1.0,1.0,0.19...|
|[0.0,1.0,1.0,0.2,...|
|[0.0,1.0,1.0,0.22...|
|[0.0,1.0,1.0,0.20...|
|[0.0,1.0,2.0,0.19...|
|[0.0,0.0,2.0,0.16...|
|[0.0,0.0,1.0,0.13...|
|[0.0,1.0,1.0,0.15...|
|[0.0,1.0,2.0,0.16...|
|[0.0,1.0,1.0,0.17...|
|[0.0,1.0,1.0,0.16...|
|[0.0,1.0,1.0,0.16...|
|[0.0,0.0,2.0,0.23...|
|[0.0,0.0,1.0,0.23...|
|[0.0,0.0,2.0,0.17...|
|[0.0,1.0,2.0,0.21...|
|[0.0,1.0,2.0,0.29...|
|[0.0,1.0,2.0,0.26...|
+--------------------+
only showing top 20 rows



In [14]:
# Regresja liniowa oczekuje standardowo żeby kolumna wartości objaśnianych była nazwana 'label',
# możemy zatem przezwać kolumnę lub przekazać inną nazwę.
t.select(t.casual).show()

+------+
|casual|
+------+
| 331.0|
| 131.0|
| 120.0|
| 108.0|
|  82.0|
|  88.0|
| 148.0|
|  68.0|
|  54.0|
|  41.0|
|  43.0|
|  25.0|
|  38.0|
|  54.0|
| 222.0|
| 251.0|
| 117.0|
|   9.0|
|  78.0|
|  83.0|
+------+
only showing top 20 rows



In [15]:
# tworzenie estymatora
lr = LinearRegression(maxIter=100, labelCol='casual')

In [16]:
# Możemy wytrenować estymator bezpośrednio używając metody fit(),
# ale lepiej jest użyć pipeline który połączy transormatory i estymatory ze sobą.
p = Pipeline(stages=[va, lr])

In [17]:
# Trenujemy model używając podstawowej formy danych; 
# niezbędne przetworzenia wykona za nas pipeline
lrmodel = p.fit(train_df)

In [18]:
# predykcja
test_df = spark.createDataFrame([(0.0,0.0,2.0,0.344167,0.805833,0.160446)], va.getInputCols())
lrmodel.transform(test_df).show()

+---+----------+----------+--------+--------+---------+--------------------+-----------------+
| yr|workingday|weathersit|    temp|     hum|windspeed|            features|       prediction|
+---+----------+----------+--------+--------+---------+--------------------+-----------------+
|0.0|       0.0|       2.0|0.344167|0.805833| 0.160446|[0.0,0.0,2.0,0.34...|833.1910106215604|
+---+----------+----------+--------+--------+---------+--------------------+-----------------+



In [19]:
# prawdziwe odpowiedzi i predykcja
y_ypred = lrmodel.transform(train_df).select('casual', 'prediction')
y_ypred.show()

+------+-------------------+
|casual|         prediction|
+------+-------------------+
| 331.0|  833.1910106215604|
| 131.0|  835.6598923815511|
| 120.0| -91.81121696909565|
| 108.0|  -65.9853277327943|
|  82.0| 29.782883222540818|
|  88.0|  41.62183094984994|
| 148.0|-144.20092633851618|
|  68.0|  480.2598743011703|
|  54.0|  480.7235216607155|
|  41.0|-180.79315993583612|
|  43.0|-236.51097964599182|
|  25.0| -263.8912487257679|
|  38.0|  -220.755093856416|
|  54.0| -91.28559111137804|
| 222.0|  740.3919892933468|
| 251.0|  816.7261771692931|
| 117.0|  571.3868919246689|
|   9.0|-237.92613869842228|
|  78.0| -90.49646963007433|
|  83.0| -53.71771177662811|
+------+-------------------+
only showing top 20 rows



In [20]:
# policzmy R^2 ręcznie
df_r2 = y_ypred.select(
    func.mean('casual').alias('mean'), 
    func.variance('casual').alias('variance'), 
    (func.sum(func.pow(y_ypred.casual - y_ypred.prediction, 2))/func.count('casual')).alias('residual_variance')) 
df_r2.show()

+-----------------+-----------------+------------------+
|             mean|         variance| residual_variance|
+-----------------+-----------------+------------------+
|848.1764705882352|471450.4414182111|153652.89278390468|
+-----------------+-----------------+------------------+



In [22]:
r = df_r2.collect()[0]

R2 = 1 - r.residual_variance/r.variance

print("R^2 is: {:.3f}".format(R2))

R^2 is: 0.674


### Zadanie

* Użyj skali logarytmicznej dla `casual`.
* ★ Przeprowadź kroswalidację (podpowiedź: `train_df.sample(False, 0.75)` lub sprawdź `pyspark.ml.tuning.CrossValidator`).