# Class 8 Notebook 1: Working with text file

Class 8 (7 Dec 2016) of [BS1804-1617 Fundamentals of Database Technologies](https://imperialbusiness.school/category/bs1804-1617/) by [Piotr Migdal](http://p.migdal.pl/)

We will work on [The Tragedy of Romeo and Juliet by William Shakespeare](http://www.gutenberg.org/ebooks/1112) from [Project Gutenberg](https://www.gutenberg.org/).

References:

* [PySpark list of RDD methods](http://spark.apache.org/docs/latest/api/python/pyspark.html)
* [Spark Programming Guide](http://spark.apache.org/docs/latest/programming-guide.html), especially:
  * [Transformations and Actions](http://spark.apache.org/docs/latest/programming-guide.html#transformations)

First, we need to download this file:

In [1]:
!wget -O /tmp/romeo_and_juliet.txt http://www.gutenberg.org/cache/epub/1112/pg1112.txt

--2016-12-11 19:50:46--  http://www.gutenberg.org/cache/epub/1112/pg1112.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 178983 (175K) [text/plain]
Saving to: ‘/tmp/romeo_and_juliet.txt’


2016-12-11 19:50:52 (276 KB/s) - ‘/tmp/romeo_and_juliet.txt’ saved [178983/178983]



In [2]:
import pyspark
sc = pyspark.SparkContext('local[*]')

In [3]:
# only an instruction for loading, nothing happends at this point
rdd = sc.textFile("file:/tmp/romeo_and_juliet.txt")

In [4]:
# the number of lines
rdd.count()

4853

In [5]:
# it reads file line by line
rdd.take(10)

['The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare',
 '',
 'This eBook is for the use of anyone anywhere at no cost and with',
 'almost no restrictions whatsoever.  You may copy it, give it away or',
 're-use it under the terms of the Project Gutenberg License included',
 'with this eBook or online at www.gutenberg.org/license',
 '',
 '',
 'Title: Romeo and Juliet',
 '']

In [6]:
# the longest 5 line
rdd.top(5, len)

['End of the Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare',
 '    art thou sociable, now art thou Romeo; now art thou what thou art, by',
 "    her into a fool's paradise, as they say, it were a very gross kind of",
 "    an ill thing to be off'red to any gentlewoman, and very weak dealing.",
 '    man that hath a hair more or a hair less in his beard than thou hast.']

In [7]:
rdd.takeSample(False, 5)

['  Jul. Three words, dear Romeo, and good night indeed.',
 '    Prodigious birth of love it is to me',
 "    And she steal love's sweet bait from fearful hooks.",
 'such as creation of derivative works, reports, performances and',
 '    Which, well thou knowest, is cross and full of sin.']

## Exercises

What is:

* the number of empty lines?
* the number of words? (hint: `flatMap`, `x.split()`)
* the number of UPPERCASE words?  (hint: `x.isupper()`)
* the 10 most popular words?  (hint: `reduceByKey`, `top`)
* count for each letter, sorted alphabetically? (hint: `orderBy`)
* count of indentions (hint: `len(x) - len(x.lstrip())`)

Additional: [Converting a String to a List of Words? - Stack Overflow](http://stackoverflow.com/questions/6181763/converting-a-string-to-a-list-of-words)

In [8]:
rdd.filter(lambda x: x == "").count()

1234

In [9]:
rdd.flatMap(lambda x: x.split()).count()

28983

In [10]:
rdd \
  .flatMap(lambda x: x.split()) \
  .filter(lambda x: x.isupper()) \
  .count()

1151

In [11]:
words = rdd.flatMap(lambda x: x.split()).cache()

In [12]:
from operator import add

In [14]:
words \
  .map(lambda x: (x, 1)) \
  .reduceByKey(add) \
  .top(10, lambda kv:kv[1])

[('the', 762),
 ('I', 549),
 ('and', 539),
 ('to', 522),
 ('of', 485),
 ('a', 453),
 ('in', 330),
 ('is', 322),
 ('my', 310),
 ('with', 274)]

In [15]:
words \
  .flatMap(list) \
  .map(lambda x: (x, 1)) \
  .reduceByKey(add) \
  .sortBy(lambda kv:kv[1]) \
  .collect()

[('%', 1),
 ('#', 2),
 ('$', 2),
 ('>', 2),
 ('<', 2),
 ('Z', 2),
 ('@', 2),
 ('X', 3),
 ('Q', 4),
 ('7', 7),
 ('6', 7),
 ('4', 9),
 ('8', 9),
 ('5', 15),
 ('9', 16),
 ('3', 16),
 ('K', 19),
 ('0', 19),
 ('"', 24),
 ('2', 25),
 ('/', 28),
 ('*', 32),
 ('z', 33),
 (':', 36),
 ('(', 38),
 (')', 38),
 ('V', 53),
 ('q', 72),
 ('1', 94),
 ('U', 94),
 (']', 94),
 ('[', 97),
 ('Y', 110),
 ('D', 156),
 ('x', 157),
 ('j', 160),
 ('L', 215),
 ('J', 215),
 ('H', 246),
 ('G', 268),
 ('F', 283),
 ('P', 311),
 ('N', 320),
 ('C', 320),
 ('-', 322),
 ('B', 335),
 ('?', 371),
 ('O', 374),
 ('M', 380),
 (';', 393),
 ('E', 393),
 ('S', 400),
 ('W', 455),
 ('R', 484),
 ('!', 491),
 ('A', 621),
 ('T', 806),
 ('k', 968),
 ('I', 984),
 ("'", 991),
 ('v', 1116),
 ('p', 1673),
 ('b', 1687),
 ('g', 1992),
 ('f', 2194),
 ('w', 2401),
 ('c', 2431),
 (',', 2628),
 ('.', 2731),
 ('y', 2828),
 ('m', 3432),
 ('u', 3995),
 ('d', 4284),
 ('l', 5070),
 ('i', 6870),
 ('s', 6982),
 ('h', 7159),
 ('n', 7173),
 ('r', 7202),

In [16]:
rdd \
  .map(lambda x: len(x) - len(x.lstrip())) \
  .map(lambda x: (x, 1)) \
  .reduceByKey(add) \
  .collect()

[(0, 1623),
 (2, 872),
 (4, 2188),
 (6, 10),
 (8, 2),
 (10, 1),
 (12, 1),
 (14, 3),
 (16, 6),
 (18, 1),
 (20, 2),
 (22, 5),
 (24, 9),
 (26, 4),
 (28, 2),
 (30, 2),
 (36, 1),
 (38, 3),
 (40, 1),
 (42, 1),
 (44, 2),
 (48, 1),
 (50, 1),
 (52, 2),
 (58, 1),
 (1, 1),
 (5, 28),
 (7, 1),
 (9, 1),
 (11, 5),
 (13, 2),
 (15, 2),
 (17, 1),
 (19, 7),
 (21, 5),
 (23, 7),
 (25, 3),
 (27, 3),
 (31, 1),
 (33, 1),
 (35, 1),
 (37, 6),
 (39, 1),
 (41, 1),
 (43, 3),
 (45, 2),
 (47, 1),
 (49, 1),
 (51, 5),
 (53, 3),
 (55, 2),
 (57, 15)]