### Exploring Benford's Law with Spark RDDs

#### Name: `Jeff Scanlon`
#### AndrewID: `jscanlo2`

In [None]:
#!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#!wget -q https://apache.claz.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
#!tar xf spark-3.0.1-bin-hadoop2.7.tgz
#!pip install -q findspark

In [None]:
#import os
#os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
#os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

In [None]:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()

ModuleNotFoundError: ignored

In [None]:
#uploaded = files.upload()

In [None]:
lines = sc.textFile('tallest-buildings.txt')

In [None]:
lines.take(5)

['|Mixed use! [[Burj Khalifa]] ! {{flag|United Arab Emirates}} ! [[Dubai]] ! 829.8 ! 2722 ! 2010 ! {{coord|25|11|50.0|N|55|16|26.6|E|type:landmark|name=Burj Dubai}}',
 '|Self-supporting tower ! [[Tokyo Skytree]] ! {{flag|Japan}} ! [[Tokyo]] ! 634 ! 2080 ! 2011 ! {{Coord|35|42|36.5|N|139|48|39|E|type:landmark|name=Tokyo Skytree}}',
 '|Clock building! [[Abraj Al Bait Towers]] ! {{flag|Saudi Arabia}} ! [[Mecca]] ! 601 ! 1972 ! 2011! {{coord|21|25|08|N|39|49|35|E|type:landmark|name=Abraj Al Bait Towers}}',
 '|Military structure ! Large masts of [[INS Kattabomman]] ! {{flag|India}} ! [[Tirunelveli]] ! 471 ! 1545 ! 2014 ! {{coord|8|22|42.52|N|77|44|38.45|E|type:landmark|name=INS Kattabomman Large Mast West}} ; {{coord|8|22|30.13|N|77|45|21.07|E|type:landmark|name=INS Kattabomman Large Mast East}}',
 '|Mast radiator ! [[Lualualei VLF transmitter]] ! {{flag|United States}} ! [[Lualualei Hawaii]] ! 458 ! 1503 ! 1972 ! {{coord|21|25|11.87|N|158|08|53.67|W|type:landmark|name=VLF transmitter Lualu

### The Exercise
Note the fields are seperated by the `!` character.  The 5th field (counting from 0) is the height of each structure expressed in feet.  The first few values are `[' 2722 ', ' 2080 ', ' 1972 ', ' 1545 ', ' 1503 ']`.  Determine the frequency of occurrence of each of the leading digits of the heights (in feet). Expected final answer:

```
[('1', 17),
 ('2', 6),
 ('3', 4),
 ('4', 7),
 ('5', 13),
 ('6', 5),
 ('7', 2),
 ('9', 1)]
 ```

## Answer

In [None]:
# As multiple expressions
r2 = lines.map(lambda line: line.strip().split('!'))
r2.take(5)

In [None]:
r3 = r2.map(lambda line: line[5][1:2])
r3.take(5)

In [None]:
r4 = r3.map(lambda w: (w, 1))
r4.take(5)

In [None]:
r5 = r4.groupByKey()
r5.take(5)

In [None]:
r6 = r5.map(lambda x: (x[0], sum(list(x[1]))))
r6.sortByKey().take(9)

In [None]:
# As a single expression
(
    lines.map(lambda line: line.strip().split('!'))
    .map(lambda line: line[5][1:2])
    .map(lambda w: (w, 1))
    .groupByKey()
    .map(lambda x: (x[0], sum(list(x[1]))))
    .sortByKey().take(9)
)