Skip to content

manthan787/SparkClusters

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 

SparkClusters

This repository contains custom implementation of K-Means clustering and Hierarchical Agglomerative Clustering on Apache Spark, written in Scala.

The clustering algorithms are run on Million Song Dataset and the results with visualization and analysis can be found in report.pdf.

Running The Project

You can run the pipeline to run the clustering algorithms on Million Song Dataset and building the report (with visualizations) by following the instructions below.

Requirements

Install Java 1.8

Download binaries for Scala 2.11.8

Download Spark 2.2.0

R Requirements

Install following dependencies to avoid errors while generating the Rmarkdown Report.

From your R console execute following commands:

install.packages("ggplot2")
install.packages("tm")
install.packages("wordcloud")
install.packages("formattable")

Makefile

Before you build the project, make sure you've pointed SCALA_HOME to your Scala 2.11.8 binaries home directory and SPARK_HOME to Spark 2.2.0 home directory.

To build the project:

make build

To run all the variants:

make run

If you'd like to change the input path or output path, run it like so:

make run INPUT_PATH=input/all/ OUTPUT_PATH=out/

To generate the report:

make report

About

Custom implementation of K-Means & Hierarchical Clustering on Apache Spark in Scala

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published