#Part 2

Partially based on Google's provided tutorial: https://cloud.google.com/dataproc/docs/tutorials/gcs-connector-spark-tutorial#python (see for a additional links and documentation for gcloud command line parameters and usage)

##Loading data

In [None]:
from pyspark import SparkContext, SparkConf

spark = SparkSession.builder.master("local[*]").appName('Twitter Analysis').config(
    "spark.executor.memory", "1g").config("spark.ui.port", "4050"
        ).getOrCreate()
sc = spark.sparkContext

from google.colab import drive
drive.mount('/content/drive')

raw_edges = sc.textFile('/content/drive/My Drive/twitter_analysis/edges_rdd.txt') #This is our pre-processed file containing all our twitter graph edges.

##Step 3.1:
Copy your working a_priori function code into the cell below.

In [None]:
%%writefile twitter_analysis_pagerank.py
import pyspark, time
import sys
from operator import add

if len(sys.argv) < 2:
  raise Exception("Input URI required")

def q2a(filename):
    file_rdd = sc.textFile(filename).map(lambda x: x.split()).map(lambda x: (int(x[0])-1, int(x[1])-1))
    graph_rdd = file_rdd.groupByKey().map(lambda x: (x[0], list(x[1])))
    return graph_rdd

def q2b(graph_rdd):
  def get_length(destinations):
    destinations = set(destinations)
    output = {}
    for item in destinations:
      output[item] = 1/len(destinations)
    return output
  graph_rdd = graph_rdd.map(lambda x: (x[0], get_length(x[1])))
  return graph_rdd


def q2c(transition_matrix_col_rdd):
    row_rdd = transition_matrix_col_rdd.flatMap(lambda column: ((row, (column[0], column[1][row])) for row in column[1])).groupByKey().sortByKey()
    return row_rdd

def row_multiply(row, R):
  result = 0
  for column, value in row:
      result += value * R[column]
  return result

def q2e(filename):
  graph_rows = q2c(q2b(q2a(filename)))
  N = graph_rows.count()
  R = dict(enumerate([1/N]*N))
  # R is technically not a vector; it is a dictionary of index to value
  for t in range(100):
    #print(sorted(R.items()))
    vecR = sc.broadcast(R)
    #compute R again
    row_results = graph_rows.map(lambda kv: (kv[0], row_multiply(kv[1],vecR.value)))
    R = row_results.collectAsMap()
    
  print("R:",sorted(R.items()))
  return row_results.sortBy(lambda kv: -kv[1]).take(5)


sc = pyspark.SparkContext()

time_start = time.time()

page_rank = page_rank_main(sys.argv[1])
print(page_rank)

time_end = time.time()
print(f"elapsed time is {time_end-time_start}")

Overwriting pyspark_pagerank.py


##Step 3.2:
Edit the cell to add your USERNAME

In [1]:
USERNAME="username"
%env REGION=australia-southeast1
%env ZONE=australia-southeast1-a
%env PROJECT=data301-2023-$USERNAME
%env CLUSTER=data301-2023-$USERNAME-lab4-cluster
%env BUCKET=data301-2023-$USERNAME-lab4-bucket


env: REGION=australia-southeast1
env: ZONE=australia-southeast1-a
env: PROJECT=data301-2023-username
env: CLUSTER=data301-2023-username-lab4-cluster
env: BUCKET=data301-2023-username-lab4-bucket


##Step 3.3: 
Run code to setup google cloud project and storage bucket.

In [None]:
!python3 -m pip install google-cloud-dataproc[libcst]


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting google-cloud-dataproc[libcst]
  Downloading google_cloud_dataproc-5.4.1-py2.py3-none-any.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.5/307.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting grpc-google-iam-v1<1.0.0dev,>=0.12.4
  Downloading grpc_google_iam_v1-0.12.6-py2.py3-none-any.whl (26 kB)
Installing collected packages: grpc-google-iam-v1, google-cloud-dataproc
Successfully installed google-cloud-dataproc-5.4.1 grpc-google-iam-v1-0.12.6


In [None]:
!gcloud auth login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=d2aBQ2Db7E8hRJKhKx2nny6RtxOehr&prompt=consent&access_type=offline&code_challenge=vODjiUJBZeBMKpeRsLVy2WLt2y5GrGDUoFKvBBioy6I&code_challenge_method=S256

Enter authorization code: 4/0AbUR2VOnsIt8q3fLKlAWEZuePz30nESfZ4fAG0SItWjlZOWYfX21C8XEjuisXu1WnjzK_w

You are now logged in as [lukafoy2@gmail.com].
Your current project is [None].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [None]:
!gcloud config set project $PROJECT


Updated property [core/project].


In [None]:
!gcloud services enable dataproc.googleapis.com cloudresourcemanager.googleapis.com


Operation "operations/acat.p2-652813555918-fe49351f-97f8-480c-b321-f7f07fc0d26f" finished successfully.


In [None]:
!gsutil mb -c regional -l $REGION -p $PROJECT gs://$BUCKET

Creating gs://data301-2023-lfoy-lab4-bucket/...
ServiceException: 409 A Cloud Storage bucket named 'data301-2023-lfoy-lab4-bucket' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


Run and modify the cluster create/execute/delete code for each test.

**NOTE**: it may take 5-10 minutes

In [None]:
!gcloud storage cp ./facebook-large.txt gs://$BUCKET

Copying file://./facebook-large.txt to gs://data301-2023-lfoy-lab4-bucket/facebook-large.txt


In [None]:
!gcloud dataproc clusters create $CLUSTER --region=$REGION --bucket=$BUCKET --zone=$ZONE \
--master-machine-type=n1-standard-2 --worker-machine-type=n1-standard-2 \
--image-version=1.5 --max-age=30m --num-masters=1 --num-workers=11

Waiting on operation [projects/data301-2023-lfoy/regions/australia-southeast1/operations/cb5b170d-d9a0-36ac-abe7-1e860fbb85ea].

Created [https://dataproc.googleapis.com/v1/projects/data301-2023-lfoy/regions/australia-southeast1/clusters/data301-2023-lfoy-lab4-cluster] Cluster placed in zone [australia-southeast1-a].


In [None]:
!gcloud dataproc jobs submit pyspark --cluster=$CLUSTER --region=$REGION twitter_analysis_pagerank.py -- gs://$BUCKET/facebook-large.txt

Job [a8167ed5c6fe42c89a7a891bb0ce9f6e] submitted.
Waiting for job output...
23/05/08 22:35:23 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
23/05/08 22:35:23 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
23/05/08 22:35:23 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator
23/05/08 22:35:23 INFO org.spark_project.jetty.util.log: Logging initialized @4121ms to org.spark_project.jetty.util.log.Slf4jLog
23/05/08 22:35:23 INFO org.spark_project.jetty.server.Server: jetty-9.4.z-SNAPSHOT; built: unknown; git: unknown; jvm 1.8.0_362-b09
23/05/08 22:35:23 INFO org.spark_project.jetty.server.Server: Started @4263ms
23/05/08 22:35:23 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@3f1f2527{HTTP/1.1, (http/1.1)}{0.0.0.0:43131}
23/05/08 22:35:25 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at data301-2023-lfoy-lab4-cluster-m/10.152.15.232:8032
23/05/08 22:35:25 INFO org.apache.hadoop.yarn.cl

In [None]:
!gcloud dataproc clusters delete $CLUSTER --region=$REGION --quiet

Waiting on operation [projects/data301-2023-lfoy/regions/australia-southeast1/operations/92aa967b-cae9-3112-a13e-891046fbfa86].
Deleted [https://dataproc.googleapis.com/v1/projects/data301-2023-lfoy/regions/australia-southeast1/clusters/data301-2023-lfoy-lab4-cluster].
