This repository contains custom implementation of K-Means clustering and Hierarchical Agglomerative Clustering on Apache Spark, written in Scala.
The clustering algorithms are run on Million Song Dataset and the results with visualization and analysis can be found in
Running The Project
You can run the pipeline to run the clustering algorithms on Million Song Dataset and building the report (with visualizations) by following the instructions below.
Download binaries for
Install following dependencies to avoid errors while generating the Rmarkdown Report.
From your R console execute following commands:
install.packages("ggplot2") install.packages("tm") install.packages("wordcloud") install.packages("formattable")
Before you build the project, make sure you've pointed
SCALA_HOME to your Scala 2.11.8 binaries home directory and
SPARK_HOME to Spark 2.2.0 home directory.
To build the project:
To run all the variants:
If you'd like to change the input path or output path, run it like so:
make run INPUT_PATH=input/all/ OUTPUT_PATH=out/
To generate the report: