The relationship between the change in the number of reddit comments related to Bitcoin and the price of Bitcoin is the objective of the analysis.
To perform this analysis, Spark cluster is used. Moreover, for easy contextualization of nodes, Docker is used.
Before actually contextualizing the Spark (and Hadoop) containers, make sure every machines are able to ssh
to each other by configuring /etc/hosts
, ~/.ssh/config
, and ~/.ssh/authorized_keys
. Additionally, Docker must be installed.
-
Configure Spark and Hadoop by editing files inside
config
folder.- Specify the hostname or IP of spark master in
spark-env.sh
script. - Put the hostnames or IPs of workers in
workers
file.
- Specify the hostname or IP of spark master in
-
Clone this repository then
cd
into it and get in super user mode.$ git clone https://github.com/mjorrico/pyspark-bitcoin-reddit.git $ cd pyspark-bitcoin-reddit $ sudo bash
-
Build the image.
$ ./build.sh
-
Run the master at one VM.
Use the
-f
flag to format the namenode.$ ./start-master.sh [-f]
-
Run the workers on multiple VMs.
The
-h
flag must be provided. The value is the worker's hostname that's also present in/etc/hosts
andworkers
files.$ ./start-worker.sh -h [workers_hostname]
-
The cluster should be up and running.
The Spark UI and HDFS GUI should be runnin on master on port
8080
and9870
respectively