![](imgs/kodolamaczlogo.png)

# Przetwarzanie Big Data z użyciem Apache Spark

Autor notebooka: Jakub Nowacki.

## Uczenie Maszynowe (Machine Learning) na Spark

Spark ML zawiera wiele algorytmów uczenia maszynowego, które się dobrze działają w sposób rozproszony i się skalują, w tym:

* regresja liniowa,
* regresja logistyczna,
* algorytm random forest,
* algorytm centroidów (k-means).

To API wykorzystuje RDD i wymaga nieco pracy z przydotowaniem danych.

In [2]:
import findspark
findspark.init()

In [3]:
import pyspark

spark = pyspark.sql.SparkSession.builder \
    .appName("SparkMLlib") \
    .getOrCreate()
sc = spark.sparkContext

In [4]:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from operator import add

In [5]:
# Ładujemy dane z pliku CSV.
# Szczegółowy opis danych jest dostępny w pliku README
rdd = sc.textFile("data/Bike-Sharing-Dataset/day.csv")

In [6]:
rdd.first()

'instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt'

In [7]:
all_columns = rdd.first().split(",")
all_columns

['instant',
 'dteday',
 'season',
 'yr',
 'mnth',
 'holiday',
 'weekday',
 'workingday',
 'weathersit',
 'temp',
 'atemp',
 'hum',
 'windspeed',
 'casual',
 'registered',
 'cnt']

In [8]:
x_cols = ['yr', 'workingday', 'weathersit', 'temp', 'hum', 'windspeed']
y_col = 'casual'
x_inds = [i for i, col in enumerate(all_columns) if col in x_cols]
y_ind = all_columns.index(y_col)

In [9]:
lps = rdd \
  .filter(lambda line: not line.startswith("instant")) \
  .map(lambda line: line.split(",")) \
  .map(lambda values: LabeledPoint(float(values[y_ind]), [float(values[i]) for i in x_inds])) \
  .cache()

In [10]:
lps.take(5)

[LabeledPoint(331.0, [0.0,0.0,2.0,0.344167,0.805833,0.160446]),
 LabeledPoint(131.0, [0.0,0.0,2.0,0.363478,0.696087,0.248539]),
 LabeledPoint(120.0, [0.0,1.0,1.0,0.196364,0.437273,0.248309]),
 LabeledPoint(108.0, [0.0,1.0,1.0,0.2,0.590435,0.160296]),
 LabeledPoint(82.0, [0.0,1.0,1.0,0.226957,0.436957,0.1869])]

In [11]:
# zobaczmy co znajduje się w LabelPoint
p = lps.first()

In [12]:
p.label

331.0

In [13]:
p.features

DenseVector([0.0, 0.0, 2.0, 0.3442, 0.8058, 0.1604])

In [14]:
# tworzenie regresora i uczenie na danych
lrspark = LinearRegressionWithSGD.train(lps, iterations=100, intercept=True)



In [15]:
# predykcja
lrspark.predict([0.0,0.0,2.0,0.344167,0.805833,0.160446])

912.51987276275258

In [16]:
# prawdziwe odpowiedzi i predykcja
y_ypred = lps.map(lambda x: (x.label, lrspark.predict(x.features))).cache()

In [17]:
y_ypred.take(5)

[(331.0, 912.51987276275258),
 (131.0, 903.77283936573997),
 (120.0, 128.93655592371169),
 (108.0, 176.71036069759646),
 (82.0, 163.78715609278061)]

In [18]:
# policzmy R^2 ręcznie

mean_y = y_ypred.map(lambda x: x[0]).mean()

variance_y = y_ypred.map(lambda x: x[0]).variance()

residual_variance_y = y_ypred \
  .map(lambda x: (x[0] - x[1])**2) \
  .reduce(add) / y_ypred.count()

R2 = 1 - residual_variance_y/variance_y

print("Mean: {}, Variance: {}, Residual Variance: {}".format(mean_y, variance_y, residual_variance_y))
print("R^2 is: {:.3f}".format(R2))

Mean: 848.1764705882356, Variance: 470805.5023738634, Residual Variance: 186216.51706338648
R^2 is: 0.604


### Zadanie

* Użyj skali logarytmicznej dla `casual`.
* ★ Przeprowadź kroswalidację (podpowiedź: `lps.randomSplit([0.75, 0.25])`).

In [19]:
import math

In [20]:
lps = rdd \
  .filter(lambda line: not line.startswith("instant")) \
  .map(lambda line: line.split(",")) \
  .map(lambda values: LabeledPoint(math.log10(float(values[y_ind])), [float(values[i]) for i in x_inds])) \
  .cache()

In [21]:
lps.take(5)

[LabeledPoint(2.519827993775719, [0.0,0.0,2.0,0.344167,0.805833,0.160446]),
 LabeledPoint(2.1172712956557644, [0.0,0.0,2.0,0.363478,0.696087,0.248539]),
 LabeledPoint(2.0791812460476247, [0.0,1.0,1.0,0.196364,0.437273,0.248309]),
 LabeledPoint(2.03342375548695, [0.0,1.0,1.0,0.2,0.590435,0.160296]),
 LabeledPoint(1.9138138523837167, [0.0,1.0,1.0,0.226957,0.436957,0.1869])]

In [22]:
p = lps.first()

In [23]:
lrspark = LinearRegressionWithSGD.train(lps, iterations=100, intercept=True)



In [24]:
lrspark.predict([0.0,0.0,2.0,0.344167,0.805833,0.160446])

2.6218774814630832

In [29]:
y_ypred = lps.map(lambda x: (x.label, lrspark.predict(x.features))).cache()
[(10**p[0], 10**p[1]) for p in y_ypred.take(5)]

[(331.0000000000001, 418.67543619808464),
 (131.00000000000006, 402.72396349808065),
 (119.99999999999996, 154.7767938411198),
 (108.00000000000004, 178.52028580106682),
 (82.0, 164.12663209690226)]