Bitcoin Price and Reddit Comments Analysis Using PySpark

The Analysis

The relationship between the change in the number of reddit comments related to Bitcoin and the price of Bitcoin is the objective of the analysis.

The Tools

To perform this analysis, Spark cluster is used. Moreover, for easy contextualization of nodes, Docker is used.

Contextualize

Before actually contextualizing the Spark (and Hadoop) containers, make sure every machines are able to ssh to each other by configuring /etc/hosts, ~/.ssh/config, and ~/.ssh/authorized_keys. Additionally, Docker must be installed.

Configure Spark and Hadoop by editing files inside config folder.
- Specify the hostname or IP of spark master in spark-env.sh script.
- Put the hostnames or IPs of workers in workers file.

Clone this repository then cd into it and get in super user mode.

$ git clone https://github.com/mjorrico/pyspark-bitcoin-reddit.git
$ cd pyspark-bitcoin-reddit
$ sudo bash

Build the image.
```
$ ./build.sh
```
Run the master at one VM.

Use the -f flag to format the namenode.
```
$ ./start-master.sh [-f]
```
Run the workers on multiple VMs.

The -h flag must be provided. The value is the worker's hostname that's also present in /etc/hosts and workers files.
```
$ ./start-worker.sh -h [workers_hostname]
```
The cluster should be up and running.

The Spark UI and HDFS GUI should be runnin on master on port 8080 and 9870 respectively

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
config		config
notebooks		notebooks
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build.sh		build.sh
start-master.sh		start-master.sh
start-worker.sh		start-worker.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bitcoin Price and Reddit Comments Analysis Using PySpark

The Analysis

The Tools

Contextualize

About

Releases

Packages

Contributors 3

Languages

mjorrico/pyspark-bitcoin-reddit

Folders and files

Latest commit

History

Repository files navigation

Bitcoin Price and Reddit Comments Analysis Using PySpark

The Analysis

The Tools

Contextualize

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages