The differences from Mahout.

This system aims to partially replace Hadoop Mahout. This system should be able to cope with 1TB sized dataset. This system should be implemented using Spark framework in Scala programming language.

The differences from Mahout.

Focus on textual data
Fast (faster than Mahout)
Easy to integrate
Easy to extend
Relatively small amount of abilities.

#Abilities

Classification

Naive Bayes
SVM (possibly, with no kernels. Anyway they are not usually useful in high dimensional space)
Logistic regression
Random forest

Regression

Logistic regression
Linear regression
SVM regression (Optional)

Clusterization and Topic Modelling

k-means
EM-RBF
PLSA (robust, pre-defined topics, EM-inference)
LDA (EM-inference)

Pipeline

Data processing. Converting the dataset stored on disks to an intermediate representation. For instance, in any textual data the text itself should be pre-processed (in highly customizable way), all the words should be replaced with their indexes. Anyway, any objects you are going to use (train on, test on, run on) should be converted to the intermediate representation (like Instance in Weka). There should be a tool to save the intermediate representation to disk and to read it from there.
Model training. There should be a tool to save trained model to disk and to read it from there.
Model usage.

The first two steps are to be distributed.

Nongoals

Once again. This system is not the full replacement of Hadoop Mahout. We are not going to have three different implementations of LDA. We prefer to have only PLSA (but easy to use, fast, with chess and courtesans).

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
lib		lib
src/main/scala/ru/ispras/modis/flint		src/main/scala/ru/ispras/modis/flint
test/src/scala/ru/ispras/modis.flint/classifier		test/src/scala/ru/ispras/modis.flint/classifier
uml		uml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The differences from Mahout.

Classification

Regression

Clusterization and Topic Modelling

Pipeline

Nongoals

About

Releases

Packages

Contributors 3

Languages

License

ispras/flint

Folders and files

Latest commit

History

Repository files navigation

The differences from Mahout.

Classification

Regression

Clusterization and Topic Modelling

Pipeline

Nongoals

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages