<h1 align = "center"> Spark RDD </h1>

Ověření inicializace Spark contextu.

In [1]:
sc.version

2.4.3

## Vytvoření RDD

Vytvoření RDD pomocí paralelizace kolekcí.

In [40]:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.collect().foreach(println)

1
2
3
4
5


data = Array(1, 2, 3, 4, 5)
distData = ParallelCollectionRDD[25] at parallelize at <console>:31


ParallelCollectionRDD[25] at parallelize at <console>:31

Vytvoření RDD z textového souboru pomocí Spark contextu.

In [15]:
val logFile = sc.textFile("/resources/Prezentace/LabData/notebook.log")
logFile.take(10).foreach(println)

[I 12:09:13.491 NotebookApp] Using MathJax: /static/vendor/MathJax-2.5-latest/MathJax.js
[I 12:09:13.494 NotebookApp] Using existing profile dir: u'/home/notebook/.ipython/profile_default'
[I 12:09:13.513 NotebookApp] Writing notebook server cookie secret to /home/notebook/.ipython/profile_default/security/notebook_cookie_secret
[I 12:09:13.586 NotebookApp] Serving notebooks from local directory: /resources
[I 12:09:13.586 NotebookApp] 0 active kernels 
[I 12:09:13.586 NotebookApp] The IPython Notebook is running at: http://[all ip addresses on your system]:8888/
[I 12:09:13.586 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 12:09:13.586 NotebookApp] No web browser found: could not locate runnable browser.


logFile = /resources/Prezentace/LabData/notebook.log MapPartitionsRDD[19] at textFile at <console>:29


lastException: Throwable = null


/resources/Prezentace/LabData/notebook.log MapPartitionsRDD[19] at textFile at <console>:29

Vytvoření nového RDD pomocí transformace existujícího.

In [16]:
val error = logFile.filter(line => line.contains("ERROR"))
error.take(10).foreach(println)

       # Filter out the lines that contains INFO (or ERROR, if the particular log has it)
15/10/22 06:23:09 [ERROR] o.a.s.s.s.ReceiverTracker - Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Socket data stream had no more data
15/10/22 06:23:11 [ERROR] o.a.s.s.s.ReceiverTracker - Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error connecting to localhost:7777 - java.net.ConnectException: Connection refused
15/10/22 06:23:13 [ERROR] o.a.s.s.s.ReceiverTracker - Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error connecting to localhost:7777 - java.net.ConnectException: Connection refused
15/10/22 06:23:15 [ERROR] o.a.s.s.s.ReceiverTracker - Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error connecting to localhost:7777 - java.net.ConnectException: Connection refused
15/10/22 06:23:17 [ERROR] o.a.s.s.s.ReceiverTracker - Deregistered receiver for stream 0: Restarting receiver

error = MapPartitionsRDD[20] at filter at <console>:27


MapPartitionsRDD[20] at filter at <console>:27

In [18]:
val timesThree = distData.map(x => x * 3)
timesThree.collect().foreach(println)

3
6
9
12
15


timesTree = MapPartitionsRDD[22] at map at <console>:29


MapPartitionsRDD[22] at map at <console>:29

## Akce nad RDD

Zjištění počtu řádků v RDD.

In [29]:
logFile.count()

34836

Maximální hodnota v RDD.

In [31]:
distData.max()

5

Minimální hodnota v RDD.

In [33]:
distData.min()

1

Vypsání statistiky z RDD.

In [34]:
distData.stats()

(count: 5, mean: 3.000000, stdev: 1.414214, max: 5.000000, min: 1.000000)

Vrácení všech elementů RDD do kolekce.

In [37]:
distData.collect()

Array(1, 2, 3, 4, 5)

Vrácení určitého množství elementů RDD do kolekce.

In [38]:
logFile.take(10)



Sečtení RDD.

In [42]:
distData.reduce((x, y) => x + y)

15

In [43]:
distData.sum()

15.0

## Key-value pairs RDD 

Vytvoříme nové RDD ze souboru nyctaxi.csv

In [46]:
val taxi = sc.textFile("/resources/Prezentace/LabData/nyctaxi.csv")

taxi = /resources/Prezentace/LabData/nyctaxi.csv MapPartitionsRDD[28] at textFile at <console>:27


/resources/Prezentace/LabData/nyctaxi.csv MapPartitionsRDD[28] at textFile at <console>:27

Zobrazíme si prvních pět řádků

In [47]:
taxi.take(5).foreach(println)

"_id","_rev","dropoff_datetime","dropoff_latitude","dropoff_longitude","hack_license","medallion","passenger_count","pickup_datetime","pickup_latitude","pickup_longitude","rate_code","store_and_fwd_flag","trip_distance","trip_time_in_secs","vendor_id"
"29b3f4a30dea6688d4c289c9672cb996","1-ddfdec8050c7ef4dc694eeeda6c4625e","2013-01-11 22:03:00",+4.07033460000000E+001,-7.40144200000000E+001,"A93D1F7F8998FFB75EEF477EB6077516","68BC16A99E915E44ADA7E639B4DD5F59",2,"2013-01-11 21:48:00",+4.06760670000000E+001,-7.39810790000000E+001,1,,+4.08000000000000E+000,900,"VTS"
"2a80cfaa425dcec0861e02ae44354500","1-b72234b58a7b0018a1ec5d2ea0797e32","2013-01-11 04:28:00",+4.08190960000000E+001,-7.39467470000000E+001,"64CE1B03FDE343BB8DFB512123A525A4","60150AA39B2F654ED6F0C3AF8174A48A",1,"2013-01-11 04:07:00",+4.07280540000000E+001,-7.40020370000000E+001,1,,+8.53000000000000E+000,1260,"VTS"
"29b3f4a30dea6688d4c289c96758d87e","1-387ec30eac5abda89d2abefdf947b2c1","2013-01-11 22:02:00",+4.07277180000000E+00