pu4spark

A library for Positive-Unlabeled Learning for Apache Spark MLlib (ml package)

Implemented algorithms

Traditional PU

Original Positive-Unlabeled learning algorithm; firstly proposed in

Liu, B., Dai, Y., Li, X. L., Lee, W. S., & Philip, Y. (2002). Partially supervised classification of text documents. In ICML 2002, Proceedings of the nineteenth international conference on machine learning. (pp. 387–394).

Gradual Reduction PU (aka PU-LEA)

Modified Positive-Unlabeled learning algorithm; main idea is to gradually refine set of positive examples. Pseudocode was taken from:

Fusilier, D. H., Montes-y-Gómez, M., Rosso, P., & Cabrera, R. G. (2015). Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management, 51(4), 433-443.

Requirements

Spark 1.5+

(Spark 2+ was not tested, but should work if replace SparkContext by SparkSession and mllib.linalg.Vector by ml.linalg.Vector)

Linking

The library is published into Maven central and JCenter. Add the following lines depending on your build system.

Gradle

compile 'ru.ispras:pu4spark:0.3'

Maven

<dependency>
    <groupId>ru.ispras</groupId>
    <artifactId>pu4spark</artifactId>
    <version>0.3</version>
</dependency>

SBT

libraryDependencies += "ru.ispras" % "pu4spark" % "0.3"

Building from Sources

Build library with gradle:

./gradlew jar

Usage example

val inputLabelName = "category"
val srcFeaturesName = "srcFeatures"
val outputLabel = "outputLabel"

val puLearnerConfig = TraditionalPULearnerConfig(0.05, 1, LogisticRegressionConfig())
val puLearner = puLearnerConfig.build()
val df = ... //needed df that contains at least the following columns:
// binary label for positive and unlabel (inputLabelName)
// and features assembled as vector (featuresName)

val weightedDF = puLearner.weight(preparedDf, inputLabelName, srcFeaturesName, outputLabel)

Returned dataframe contains probability estimation for each instance in the column outputLabel.

Features can be assembled to one column by using VectorAssembler:

val assembler = new VectorAssembler()
  .setInputCols(df.columns.filter(c => c != rowName)) //keep here only feature columns
  .setOutputCol(featuresName)
val pipeline = new Pipeline().setStages(Array(assembler))
val preparedDf = pipeline.fit(df).transform(df)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
gradle/wrapper		gradle/wrapper
src/main/scala/ru/ispras/pu4spark		src/main/scala/ru/ispras/pu4spark
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gradle/wrapper

gradle/wrapper

src/main/scala/ru/ispras/pu4spark

src/main/scala/ru/ispras/pu4spark

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.gradle

build.gradle

gradlew

gradlew

gradlew.bat

gradlew.bat

settings.gradle

settings.gradle

Repository files navigation

pu4spark

Implemented algorithms

Traditional PU

Gradual Reduction PU (aka PU-LEA)

Requirements

Linking

Gradle

Maven

SBT

Building from Sources

Usage example

About

Releases

Packages

Languages

License

ispras/pu4spark

Folders and files

Latest commit

History

Repository files navigation

pu4spark

Implemented algorithms

Traditional PU

Gradual Reduction PU (aka PU-LEA)

Requirements

Linking

Gradle

Maven

SBT

Building from Sources

Usage example

About

Resources

License

Stars

Watchers

Forks

Languages