### Part 2
In this part, you will implement a simple Spark application. We have provided some sample data collected at this link (using wget). Download the file to your home directory of vm1.).

In [1]:
import findspark as fs
fs.init('/home/ubuntu/spark-3.3.1-bin-hadoop3')
fs.find()

'/home/ubuntu/spark-3.3.1-bin-hadoop3'

In [2]:
from pyspark.sql import SparkSession

spark = (SparkSession.builder.appName("DS5110")
            .master("spark://172.31.75.157:7077")
            .config("spark.executor.memory", "1024M")
            .getOrCreate())

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/02/26 11:24:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
df = spark.read.csv("hdfs://172.31.75.157:9000/export.csv", inferSchema="true", header="true")

AnalysisException: Path does not exist: hdfs://172.31.75.157:9000/export.csv

24/02/26 11:26:42 ERROR TaskSchedulerImpl: Lost executor 0 on 172.31.75.157: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
24/02/26 11:26:43 WARN StandaloneAppClient$ClientEndpoint: Connection to 172.31.75.157:7077 failed; waiting for master to reconnect...
24/02/26 11:26:43 WARN StandaloneSchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
24/02/26 11:26:43 WARN StandaloneAppClient$ClientEndpoint: Connection to 172.31.75.157:7077 failed; waiting for master to reconnect...


In [None]:
df.select("battery_level").show()

You then need to sort the data firstly by the country code alphabetically (the third column ccr2) then by the timestamp (the last column).

In [None]:
new_df = df.orderBy("cca2", "timestamp")

In [None]:
new_df.show()

In [None]:
new_df.write.format("csv").mode("overwrite").save("hdfs://172.31.75.157:9000/new_df_hdfs")

### Part 3
In this part, you will implement the PageRank algorithm (§2.1) (or the Wikipedia version), which is an algorithm used by search engines like Google to evaluate the quality of links to a webpage. The algorithm can be summarized as follows:

1. Set initial rank of each page to be 1.
2. On each iteration, each page p contributes to its outgoing neighbors a value of rank(p)/(# of outgoing neighbors of p).
3. Update each page’s rank to be 0.15 + 0.85 * (sum of contributions).
4. Go to next iteration.

**Task 1**
Write a PySpark application that implements the PageRank algorithm. Your PageRank application should output the following two results: 1) print the first 50 rows with the highest ranks; 2) save the computed results as a Spark DF to HDFS as an HDFS csv file.

In [None]:
import re
from pyspark.sql import SparkSession
from typing import Iterable, Tuple
from pyspark.resultiterable import ResultIterable

# from skeleton code:
# Helper function to calculate URL contributions to the rank of other URLs
def calculateRankContrib(urls: ResultIterable[str], rank: float) -> Iterable[Tuple[str, float]]:
    num_urls = len(urls)
    for url in urls:
        yield (url, rank / num_urls)

# # from skeleton code:
# Helper function to parse a urls string into a tuple of URLs
def parseNeighborURLs(urls: str) -> Tuple[str, str]:
    parts = re.split(r'\s+', urls)
    return parts[0], parts[1]

In [None]:
linesRDD = spark.sparkContext.textFile("hdfs://172.31.75.157:9000/web-BerkStan.txt")

In [None]:
linksRDD = linesRDD.map(lambda urls: parseNeighborURLs(urls)).distinct().groupByKey()
ranksRDD = linksRDD.map(lambda url_neighbors: (url_neighbors[0], 1.0))

In [None]:
for i in range(3):
    # Join the links RDD with the ranks RDD and calculate contributions
    contributions = linksRDD.join(ranksRDD).flatMap(lambda x: calculateRankContrib(x[1][0], x[1][1]))
    # Update ranks based on contributions
    ranksRDD = contributions.reduceByKey(lambda x, y: x + y).mapValues(lambda x: 0.15 + 0.85 * x)

In [None]:
ranksDF = ranksRDD.toDF()

In [None]:
ranksDF.write.format("csv").save("hdfs://172.31.75.157:9000/ranksDF_pt1")

In [None]:
# top 50 URLs by rank
top_50 = ranksRDD.takeOrdered(50, key=lambda x: -x[1])
# Print the top 50
for url, rank in top_50:
    print(f"URL: {url}, Rank: {rank}")

**Task 2**
In order to achieve high parallelism, Spark will split the data into smaller chunks called partitions, which are distributed across different nodes in the cluster. Partitions can be changed in several ways. For example, any shuffle operation on an RDD (e.g., join()) will result in a change in partitions (customizable via user’s configuration). In addition, one can also decide how to partition data when creating/configuring RDDs (hint: e.g., you can use the function partitionBy()). For this task, add appropriate custom RDD partitioning and see what changes. For the computed result: your PageRank application should print the first 50 rows with the highest ranks.

In [None]:
# same as before but use partitionBy
linksRDD = linesRDD.map(lambda urls: parseNeighborURLs(urls))\
.distinct().groupByKey().partitionBy(4)

In [None]:
# Initialize a ranks RDD
ranksRDD = linksRDD.map(lambda url_neighbors: (url_neighbors[0], 1.0))

In [None]:
# Calculates and updates URL ranks continuously using PageRank algorithm
for i in range(3):
    # Join the links RDD with the ranks RDD and calculate contributions
    contributions = linksRDD.join(ranksRDD).flatMap(lambda x: calculateRankContrib(x[1][0], x[1][1]))
    # Update ranks based on contributions
    ranksRDD = contributions.reduceByKey(lambda x, y: x + y).mapValues(lambda x: 0.15 + 0.85 * x)

In [None]:
ranksDF = ranksRDD.toDF()

In [None]:
ranksDF.write.format("csv").save("hdfs://172.31.75.157:9000/ranksDF_pt2")

In [None]:
# top 50 URLs by rank
top_50 = ranksRDD.takeOrdered(50, key=lambda x: -x[1])
# Print the top 50
for url, rank in top_50:
    print(f"URL: {url}, Rank: {rank}")

**Task 3**
Kill a Worker process and see the changes. You should trigger the failure to a selected worker VM when the application reaches anywhere between 25% to 75% of its lifetime (hint: use the Spark Jobs web interface to track the detailed job execution progress):

From a shell, clear the memory cache using sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches" on vm2;

In your shell, kill the Worker process on vm2: To do so, use jps to get the process ID (PID) of the Spark Worker on vm2 and then use the command kill -9 <Worker_PID> to kill the Spark Worker process.

For the computed result: your PageRank application should print the first 50 rows with the highest ranks.