SparkTDA

The scalable topological data analysis package for Apache Spark. This project aims to implement the following features:

Scalable Mapper Implemented as Reeb Diagrams, i.e., Reeb Cosheaves
Scalable Mapper Implementation
Scalable Multiscale Mapper Implementation
Scalable Tower Computation for Multiscale Mapper
Scalable Persistent Homology Computation on Top of Apache Spark

If you would like to know how to use and/or learn more the implementation details of the above mentioned features, please follow the links.

Status

WIP and EXPERIMENTAL. This package is still a proof-of-concept of scalable topological data analysis support for Apache Spark, hence you cannot expect that this package is ready for production use.

Examples

Mapper

2-skeltons of Reeb Diagram of MNIST (40 intervals on the 1st primcipal component with 50% overlap)	2-skeltons of Reeb Diagram of MNIST (20 intervals on the 1st primcipal component with 50% overlap)
60k images clustered in 784 dimensions without any projection loss	60k images clustered in 784 dimensions witout any projection loss

Requirements

This library requires Spark 2.0+

Building and Running Unit Tests

To compile this project, run sbt package from the project home directory. This will also run the Scala unit tests. To run the unit tests, run sbt test from the project home directory. This project uses the sbt-spark-package plugin, which provides the 'spPublish' and 'spPublishLocal' task. We recommend users to use this library with Apache Spark including the dependencies by supplying a comma-delimited list of Maven coordinates with --packages and download the package from the locally repository or official Spark Packages repository.

The package can be published locally with:

$ sbt spPublishLocal

The package can be published to Spark Packages with (requires authentication and authorization):

$ sbt spPublish

Using with Spark Shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

$ spark-shell --packages ognis1205:spark-tda:0.0.1-SNAPSHOT-spark2.2-s_2.11

Future Works

Mapper

Write Wiki
Implement Python APIs
Publish to Spark Packages
Benchmark
Consider using GraphFrames instead of plain GraphX
Implement some useful filter functions, e.g., Gaussian Density, Graph Laplacian, etc as transformers

Related Softwares & Projects

References

Mapper

KNN/ANN/SNN

LSH

M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms, 34th STOC, 2002.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
data		data
dev		dev
project		project
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
scalastyle-config.xml		scalastyle-config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkTDA

Status

Examples

Mapper

Requirements

Building and Running Unit Tests

The package can be published locally with:

The package can be published to Spark Packages with (requires authentication and authorization):

Using with Spark Shell

Future Works

Mapper

Related Softwares & Projects

References

Mapper

KNN/ANN/SNN

LSH

About

Releases

Packages

Languages

License

ognis1205/spark-tda

Folders and files

Latest commit

History

Repository files navigation

SparkTDA

Status

Examples

Mapper

Requirements

Building and Running Unit Tests

The package can be published locally with:

The package can be published to Spark Packages with (requires authentication and authorization):

Using with Spark Shell

Future Works

Mapper

Related Softwares & Projects

References

Mapper

KNN/ANN/SNN

LSH

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages