# Trening Spark RDD

## Inicjowanie Sparka
Pracę ze Sparkiem zaczynamy od zainicjowania sesji Sparka.
Aby zainicjować Sparka, musimy zaimportować pakiet `findspark` i uruchomić metodę `init()`:

In [22]:
import findspark
findspark.init() 

Następnie tworzymy obiekt sesji Sparka. Zwróć uwagę na ustawienie nazwy aplikacji Sparka: 

In [1]:
from pyspark.sql import SparkSession

In [2]:
# obiekt sesji zwykle ma nazwę "spark"
spark = SparkSession.builder.appName("DataScience").getOrCreate() 
spark

24/04/27 11:54:54 WARN Utils: Your hostname, Pablos-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.94 instead (on interface en0)
24/04/27 11:54:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/27 11:54:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/27 11:54:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Mając obiekt sesji możemy wyciągnąć z niego tzw. kontekst Sparka, który pozwala na bezpośrednią pracę z kolekcjami obiektów RDD oraz podstawowymi usługami Sparka.

In [3]:
sc = spark.sparkContext
sc

## Ćwiczenie: Zliczanie słów w tekście
Wykorzystując obiekty RDD spróbujemy policzyć ilość słów w tekście, a także zliczyć ilość wystąpień poszczególnych słów.

W tym celu wykonamy następujące kroki:
- wczytamy plik tekstowy do obiektu RDD
- podzielimy tekst na słowa
- zliczymy ilość słów.

Następnie zobaczymy, jakie słowa występują najczęściej w tekście.

Zastosujemy też transformację filtrującą, która pozwoli nam na pozbycie się niepotrzebnych słów, takich jak spójniki, przyimki, itp. Są one zdefiniowane w pliku `data/stopwords.txt`.


## Wczytywanie danych

Naszym ćwiczeniem będzie analiza tekstu w języku angielskim znajdującego się w pliku `data/books/ulysses.txt`.

Pierwszym krokiem jest wczytanie pliku tekstowego do obiektu RDD. W tym celu wykorzystamy metodę `textFile()`, która pozwala na wczytanie pliku tekstowego do obiektu RDD.


In [5]:
text_rdd = sc.textFile("../../data/books/ulysses.txt")

In [6]:
text_rdd

../../data/books/ulysses.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [10]:
text_rdd.getNumPartitions()

2

In [11]:
filter_rdd = text_rdd.filter(lambda line: "Ulysses" in line)

In [12]:
lines = text_rdd.collect()

In [13]:
len(lines)

33216

In [15]:
filter_rdd.collect()

['The Project Gutenberg eBook of Ulysses, by James Joyce',
 'Title: Ulysses',
 'Ulysses',
 'man, shipwrecked in storms dire, Tried, like another Ulysses, Pericles,',
 'makes Ulysses quote Aristotle.',
 'in Spain, and Ulysses Browne of Camus that was fieldmarshal to Maria',
 'general Ulysses Grant whoever he was or did supposed to be some great']

In [16]:
text_rdd.take(20)

['The Project Gutenberg eBook of Ulysses, by James Joyce',
 '',
 'This eBook is for the use of anyone anywhere in the United States and',
 'most other parts of the world at no cost and with almost no restrictions',
 'whatsoever. You may copy it, give it away or re-use it under the terms',
 'of the Project Gutenberg License included with this eBook or online at',
 'www.gutenberg.org. If you are not located in the United States, you',
 'will have to check the laws of the country where you are located before',
 'using this eBook.',
 '',
 'Title: Ulysses',
 '',
 'Author: James Joyce',
 '',
 'Release Date: December 27, 2001 [eBook #4300]',
 '[Most recently updated: December 27, 2019]',
 '',
 'Language: English',
 '',
 'Character set encoding: UTF-8']

In [18]:
text_rdd.top(5)

['•',
 '“YOU CAN DO IT!”',
 '’Tis, sure. What say? In the speakeasy. Tight. I shee you, shir.',
 '’Tis the last rose of summer dollard left bloom felt wind wound round',
 '—’lldo! cried Father Cowley.']

In [19]:
x = [1,2,3,4,5]

In [20]:
rdd = sc.parallelize(x)

In [22]:
rdd.collect()

[1, 2, 3, 4, 5]

In [23]:
rdd.first()

1

In [24]:
rdd.take(3)

[1, 2, 3]

In [25]:
rdd2 = rdd.map(lambda x: x*x)

In [26]:
rdd2.collect()

[1, 4, 9, 16, 25]

In [27]:
x = [y for y in range(1000)]


In [40]:
rdd = sc.parallelize(x)

In [50]:
rdd3 = rdd.map(lambda x: x*x*x)

In [51]:
rdd3.count()

1000

In [52]:
rdd3.take(5)

[0, 1, 16, 81, 256]

In [53]:
nieparzyste_rdd3 = rdd3.filter(lambda x: x%2 == 1)

In [54]:
nieparzyste_rdd3.count()

500

In [55]:
parzyste_rdd3 = rdd3.filter(lambda x: x%2 == 0)

In [56]:
parzyste_rdd3.count()

500

In [60]:
def filter_nieparzystne(x: int) -> int:
    return x % 2 == 1

In [59]:
filter_nieparzystne(5)

True

In [61]:
def power(x: int) -> int:
    return x * x

In [62]:
rdd.take(5)

[0, 1, 2, 3, 4]

In [63]:
rdd_niep = rdd.map(power).filter(filter_nieparzystne)

In [65]:
def greather_than(x, y):
    return x > y

In [71]:
wartosc_progu = 1000
rdd_progowe = rdd.map(power).filter(filter_nieparzystne).filter(lambda x: greather_than(x, wartosc_progu))

In [72]:
rdd_progowe.count()

484

In [73]:
text_rdd.take(5)

['The Project Gutenberg eBook of Ulysses, by James Joyce',
 '',
 'This eBook is for the use of anyone anywhere in the United States and',
 'most other parts of the world at no cost and with almost no restrictions',
 'whatsoever. You may copy it, give it away or re-use it under the terms']

In [75]:
def pusta_linia(linia: str) -> bool:
    return len(linia) == 0

In [76]:
rdd_puste_linie = text_rdd.filter(pusta_linia)

In [77]:
rdd_puste_linie.count()

7419

In [78]:
text_rdd.count()

33216

In [79]:
def niepusta_linia(linia: str) -> bool:
    return len(linia) > 0

In [80]:
rdd_niepuste_linie = text_rdd.filter(niepusta_linia)

In [81]:
rdd_niepuste_linie.count()

25797

In [82]:
7419 + 25797

33216

In [86]:
def tablica_slow_w_linii(linia: str) -> list[str]:
    return linia.split()


In [87]:
tablica_slow_w_linii("hello world")

['hello', 'world']

In [91]:
rdd_tablice_slow = rdd_niepuste_linie.map(tablica_slow_w_linii)

In [92]:
rdd_tablice_slow.take(1)

[['The',
  'Project',
  'Gutenberg',
  'eBook',
  'of',
  'Ulysses,',
  'by',
  'James',
  'Joyce']]

In [96]:
rdd_flat = rdd_niepuste_linie.flatMap(tablica_slow_w_linii)

In [97]:
rdd_flat.take(5)

['The', 'Project', 'Gutenberg', 'eBook', 'of']

In [98]:
rdd_flat.take(20)

['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Ulysses,',
 'by',
 'James',
 'Joyce',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'in',
 'the']

In [101]:
stop_words = {"is", "for", "the", "of", "by"}

In [105]:
def filter_stop_words(word: str) -> bool:
    return word not in stop_words

In [106]:
rdd_slowa = rdd_flat.filter(filter_stop_words)

In [107]:
rdd_slowa.take(5)

['The', 'Project', 'Gutenberg', 'eBook', 'Ulysses,']

In [109]:
def mapuj_slowo_na_pare(word: str) -> tuple[str, int]:
    return word, 1

In [110]:
mapuj_slowo_na_pare("slowo")

('slowo', 1)

In [111]:
rdd_pary_slow = rdd_slowa.map(mapuj_slowo_na_pare)

In [112]:
rdd_pary_slow.take(5)

[('The', 1), ('Project', 1), ('Gutenberg', 1), ('eBook', 1), ('Ulysses,', 1)]

In [115]:
rdd_liczniki_slow = rdd_pary_slow.reduceByKey(lambda a, b: a + b)

In [116]:
rdd_liczniki_slow.take(10)

[('Gutenberg', 22),
 ('eBook', 6),
 ('Ulysses,', 2),
 ('by', 1177),
 ('James', 28),
 ('use', 42),
 ('anyone', 17),
 ('United', 19),
 ('States', 7),
 ('and', 6543)]

In [117]:
rdd_liczniki_slow.count()

49913

In [118]:
rdd_slowa_posortowane = rdd_liczniki_slow.sortBy(lambda word_pair: word_pair[0].lower())

In [120]:
rdd_slowa_posortowane.take(10)

[('"Defects,"', 1),
 ('"Information', 1),
 ('"Plain', 2),
 ('"Project', 5),
 ('"Right', 1),
 ('#4300]', 1),
 ('$5,000)', 1),
 ('%', 4),
 ('&c,', 2),
 ('&c.', 1)]

In [121]:
def posortuj_slowa(word_pair: tuple[str, int]) -> int:
    return word_pair[1]

In [122]:
rdd_posortowane = rdd_liczniki_slow.sortBy(posortuj_slowa, ascending=True)

In [124]:
rdd_posortowane.take(5)

[('Author:', 1), ('[eBook', 1), ('#4300]', 1), ('Produced', 1), ('Col', 1)]

In [125]:
rdd_posortowane = rdd_liczniki_slow.sortBy(posortuj_slowa, ascending=False)

In [126]:
rdd_posortowane.take(10)

[('and', 6543),
 ('a', 5839),
 ('to', 4788),
 ('in', 4612),
 ('his', 3034),
 ('he', 2712),
 ('I', 2429),
 ('with', 2391),
 ('that', 2167),
 ('was', 2006)]

In [129]:
from src.stop_words import stop_words

In [137]:
rdd_posortowane.saveAsTextFile("data/words.txt")

In [138]:
!ls -l data/words.txt

total 1528
-rw-r--r--  1 katana  staff       0 Apr 14 14:37 _SUCCESS
-rw-r--r--  1 katana  staff  256506 Apr 14 14:37 part-00000
-rw-r--r--  1 katana  staff  523517 Apr 14 14:37 part-00001


In [139]:
!cat data/words.txt/part-00000 | head

('and', 6543)
('a', 5839)
('to', 4788)
('in', 4612)
('his', 3034)
('he', 2712)
('I', 2429)
('with', 2391)
('that', 2167)
('was', 2006)
cat: stdout: Broken pipe


In [140]:
!cat data/words.txt/part-00001 | head

('Author:', 1)
('[eBook', 1)
('#4300]', 1)
('Produced', 1)
('Col', 1)
('Choat', 1)
('Widger', 1)
('START', 1)
('stairhead,', 1)
('crossed.', 1)
cat: stdout: Broken pipe


In [141]:
slowa_bez_powtorzen = rdd_slowa.distinct()

In [142]:
slowa_bez_powtorzen.takeOrdered(5, 

['Gutenberg', 'eBook', 'Ulysses,', 'James', 'use']

In [144]:
word_pairs = sc.textFile("data/words.txt")

In [145]:
word_pairs.take(5)

["('Author:', 1)",
 "('[eBook', 1)",
 "('#4300]', 1)",
 "('Produced', 1)",
 "('Col', 1)"]

In [146]:
x = [(1,2),(3,4)]
str(x)

'[(1, 2), (3, 4)]'

In [147]:
y = [str(xx) for xx in x]
y

['(1, 2)', '(3, 4)']

In [162]:
def to_csv(word_pair: tuple[str, int]) -> str:
    return f"\"{word_pair[0]}\",{word_pair[1]}"


In [156]:
rdd_posortowane.map(to_csv).saveAsTextFile("data/words.csv")

In [157]:
!ls -l data/words.csv

total 1048
-rw-r--r--  1 katana  staff       0 Apr 14 15:24 _SUCCESS
-rw-r--r--  1 katana  staff  169891 Apr 14 15:24 part-00000
-rw-r--r--  1 katana  staff  360567 Apr 14 15:24 part-00001


In [158]:
!cat data/words.csv/part-00000

and,6543
a,5839
to,4788
in,4612
his,3034
he,2712
I,2429
with,2391
that,2167
was,2006
on,1894
it,1680
her,1505
you,1368
at,1215
by,1177
him,1112
as,1099
all,1041
The,1035
from,1012
or,939
He,908
be,823
they,768
she,768
had,766
out,760
not,744
my,708
Mr,699
their,678
up,661
like,649
me,640
have,620
an,591
A,560
one,499
them,497
And,497
about,493
when,482
said.,480
were,473
are,467
what,458
which,457
your,451
says,450
so,450
if,438
there,428
Bloom,428
but,424
said,423
old,419
over,392
this,379
down,367
no,363
would,349
then,347
after,345
who,342
into,324
Stephen,314
did,305
What,304
two,304
its,299
do,299
off,299
will,289
those,285
some,283
see,282
could,280
we,280
BLOOM:,277
other,271
man,270
said,,265
little,263
has,262
She,255
back,251
too,246
more,238
our,237
His,237
it.,237
You,235
time,233
through,231
know,230
good,225
get,224
THE,222
But,217
eyes,217
round,214
They,211
only,211
now,209
under,208
long,206
where,204
any,204
never,202
put,198
way,197
very,196
hand,195
It,194
_(He,192


In [163]:
!rm -r data/word_pairs.csv

In [164]:
rdd_posortowane.map(to_csv).saveAsTextFile("data/word_pairs.csv")

In [165]:
!cat data/word_pairs.csv/part-00000 | head

"and",6543
"a",5839
"to",4788
"in",4612
"his",3034
"he",2712
"I",2429
"with",2391
"that",2167
"was",2006
cat: stdout: Broken pipe


In [166]:
word_pairs_text = spark.read.text("data/word_pairs.csv")

In [167]:
word_pairs_text

DataFrame[value: string]

In [169]:
word_pairs_text.rdd.take(5)

[Row(value='"Author:",1'),
 Row(value='"[eBook",1'),
 Row(value='"#4300]",1'),
 Row(value='"Produced",1'),
 Row(value='"Col",1')]

In [172]:
word_pairs_text.toPandas()

Unnamed: 0,value
0,"""Author:"",1"
1,"""[eBook"",1"
2,"""#4300]"",1"
3,"""Produced"",1"
4,"""Col"",1"
...,...
49908,"""refund."",2"
49909,"""WARRANTIES"",2"
49910,"""disclaimer"",2"
49911,"""www.gutenberg.org"",2"


In [173]:
word_pairs_text.show()

+--------------+
|         value|
+--------------+
|   "Author:",1|
|    "[eBook",1|
|    "#4300]",1|
|  "Produced",1|
|       "Col",1|
|     "Choat",1|
|    "Widger",1|
|     "START",1|
|"stairhead,",1|
|  "crossed.",1|
|"ungirdled,",1|
|    "Kinch!",1|
|  "Solemnly",1|
|  "catching",1|
|"displeased",1|
|   "sleepy,",1|
|    "coldly",1|
|   "length,",1|
|"untonsured",1|
|      "hued",1|
+--------------+


In [174]:
word_pairs_csv = spark.read.csv("data/word_pairs.csv")

In [176]:
word_pairs_csv.show(10)

+----------+---+
|       _c0|_c1|
+----------+---+
|   Author:|  1|
|    [eBook|  1|
|    #4300]|  1|
|  Produced|  1|
|       Col|  1|
|     Choat|  1|
|    Widger|  1|
|     START|  1|
|stairhead,|  1|
|  crossed.|  1|
+----------+---+


In [177]:
word_pairs_csv.toPandas()

Unnamed: 0,_c0,_c1
0,Author:,1
1,[eBook,1
2,#4300],1
3,Produced,1
4,Col,1
...,...,...
49908,refund.,2
49909,WARRANTIES,2
49910,disclaimer,2
49911,www.gutenberg.org,2
