<a href="https://colab.research.google.com/github/prateekchandrajha/mastering-ml-algorithms/blob/main/SparkBigData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark
!wget -q http://188.165.231.140/pub/bigdata-spark/raw/master/war-and-peace.txt

In [6]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

In [7]:
import findspark
findspark.init()
from pyspark import SparkContext

sc = SparkContext.getOrCreate()

In [8]:
sc

In [9]:
# A helper function to compute the list of words in a line of text
import re
def get_words(line):
    return re.compile('\w+').findall(line)

print(get_words("This, is a test!"))

['This', 'is', 'a', 'test']


### Learning activity: Create RDD with `parallelize`
Transform the list `words` into an rdd. The count should return `3`

In [10]:
words = ["Hello", "Spark", "1"]

In [11]:
sc.parallelize(words).count()

3

### Learning activity: Create RDDs

To analyse large datasets using Spark you will load them into Resilient Distributed Datasets (RDDs). There are a number of ways in which you can create RDDs. Use the `parallelize()` function to create one from a Python collection, and use the `textFile()` function to create an RDD from the file `data/war-and-peace.txt`. 

In [12]:
data = sc.textFile("data/war-and-peace.txt")

In [25]:
n = 100
# data.collect(100).foreach(println)
# data.collect().foreach(lambda line: System.out.println(line))
# import spark.implicits._
# data.take(10)

Py4JJavaError: ignored

### Learning activity: Basic RDD manipulation

Print the number of lines in War and Peace using the method `count()`

In [26]:
data.count()

Py4JJavaError: ignored

Print the first 15 lines using the method `take()`.

In [27]:
data.take(15)

Py4JJavaError: ignored

### Learning activity: `filter()` and `map()` and `distinct()`

Use `filter()` to count the number of lines which mention `war` and the number of lines which mention `peace`.

In [28]:
# How often is war mentioned?
data.filter(lambda line: "war" in get_words(line)).count()

Py4JJavaError: ignored

In [None]:
# How often is peace mentioned?

Use `map()` to capitalise each line in the RDD, and print the first 15 capitalized lines.

In [None]:
# Capitalize each line in the RDD

Use `flatMap()` to create an RDD of the words in War and Peace and count the number of words.

In [None]:
# Split each line into words using get_words()

Finally, use `distinct()` to count the number of different words in the RDD.

In [None]:
# Count the number of distinct words

### Learning activity: Set like transformations

Use the function `union()` to create an RDD of lines with either war or peace mentioned. Count how many lines.

Use the function `intersection()` to create an RDD of lines with both war and peace being mentioned. Count how many lines.

Find all the lines that mention both war and peace without using `intersection()`

### Learning activity: `reduce()`

You have already seen three actions: `collect()` which returns all elements in the RDD, `take(n)`, which return the first `n` elements of the RDD, and `count()` which returns the number of elements in the RDD.

The action `reduce()` takes as input a function which collapses two elements into one. Use it to find the longest word in War and Peace.

The Python function `str.istitle()` returns `True` if the string `str` is titlecased: the first character is uppercase and others are lowercase. Use it to:
* Find the set of distinct words in War and Peace which are titlecased
* Find the set of distinct words in War and Peace which are not titlecased

The Python function `str.lower()` returns a string with all characters of str lowercase. Use it, along with your previously generated RDD to find the set of words in War and Peace which only appear titlecased.

### Learning activity: WordCount in Spark

Use the functions `flatMap()` and `reduceByKey()` to count the number of occurences of each word in War and Peace, and print the count of five words.

### Learning activity: using `groupByKey()`
Reimplement the above word count using `groupByKey()` instead of `reduceByKey()`