Custom implementation of K-Means & Hierarchical Clustering on Apache Spark in Scala
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
dependencies
input
output
src
.gitignore
MANIFEST.MF
Makefile
README.md
report.Rmd
report.pdf

README.md

SparkClusters

This repository contains custom implementation of K-Means clustering and Hierarchical Agglomerative Clustering on Apache Spark, written in Scala.

The clustering algorithms are run on Million Song Dataset and the results with visualization and analysis can be found in report.pdf.

Running The Project

You can run the pipeline to run the clustering algorithms on Million Song Dataset and building the report (with visualizations) by following the instructions below.

Requirements

Install Java 1.8

Download binaries for Scala 2.11.8

Download Spark 2.2.0

R Requirements

Install following dependencies to avoid errors while generating the Rmarkdown Report.

From your R console execute following commands:

install.packages("ggplot2")
install.packages("tm")
install.packages("wordcloud")
install.packages("formattable")

Makefile

Before you build the project, make sure you've pointed SCALA_HOME to your Scala 2.11.8 binaries home directory and SPARK_HOME to Spark 2.2.0 home directory.

To build the project:

make build

To run all the variants:

make run

If you'd like to change the input path or output path, run it like so:

make run INPUT_PATH=input/all/ OUTPUT_PATH=out/

To generate the report:

make report