Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Quantile Discretizer #90

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Helw150
Copy link

@Helw150 Helw150 commented Sep 21, 2019

Description of Changes

Adds Quantile Discretization for NumericalFeatures

Includes
  • Code changes
  • Tests
  • Documentation

@Helw150
Copy link
Author

Helw150 commented Sep 21, 2019

Initially made this using IntVector instead of RealVector for the numQuantiles, but for some reason kept getting odd and opaque errors of the following type:

[error]   last tree to typer: Select(Select(Select(Ident(breeze), linalg), DenseVector), fill$mIc$sp)
[error]        tree position: line 44 of /Users/will/oss/doddle-model/src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
[error]             tree tpe: (size: Int, v: () => Int, implicit evidence$5: scala.reflect.ClassTag[Int])breeze.linalg.DenseVector[Int]
[error]               symbol: method fill$mIc$sp in object DenseVector
[error]    symbol definition: def fill$mIc$sp(size: Int, v: () => Int, implicit evidence$5: scala.reflect.ClassTag[Int]): breeze.linalg.DenseVector[Int] (a MethodSymbol)
[error]       symbol package: breeze.linalg
[error]        symbol owners: method fill$mIc$sp -> object DenseVector
[error]            call site: method splitEvenly in object QuantileDiscretizer in package preprocessing

@inejc inejc added this to Review in progress in Kanban Board for doddle-model Sep 21, 2019
Copy link
Member

@inejc inejc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Helw150, thanks for opening the PR, this is pretty awesome 🙂. I wrote a couple of comments and suggested a change for the splitEvenly function, let me know what's your thinking about that. Regarding your comment about using IntVector for numQuantiles, I'll let you know once I look at it more thoroughly. The first guess would be to use DenseVector[Int] instead of IntVector as this is just a type alias in doddle-model and it might confuse breeze.

@inejc inejc added the enhancement New feature or request label Sep 21, 2019
@Helw150
Copy link
Author

Helw150 commented Sep 21, 2019

@inejc I think the new version should address all of your comments! Just changing to DenseVector[Int] still had me running into that issue, but I don't have any particular insights that would allow me to debug so lmk if you can figure anything out.

@picnicml picnicml deleted a comment Sep 21, 2019
@picnicml picnicml deleted a comment Sep 21, 2019
Copy link
Member

@inejc inejc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote some suggestions but this is definitely in the right direction! Let me know if you disagree with my comments.

@picnicml picnicml deleted a comment Sep 26, 2019
@Helw150
Copy link
Author

Helw150 commented Sep 26, 2019

@inejc Resolved all but one, let me know if it is a sticking point and I will change.

Copy link
Member

@inejc inejc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we are one step away from merging this. Regarding using IntVector for bucketCounts; I simply changed all : DenseVector[Double] to : IntVector, removed all internal DenseVector[Double] types and removed unnecessary .toInt and .toDouble and compilation was successful.

import io.picnicml.doddlemodel.typeclasses.Transformer
import scala.Double.{MaxValue, MinValue}

case class QuantileDiscretizer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this formatting. Is this scalafmt? We need to add a formatter to the project 😅.

Copy link
Author

@Helw150 Helw150 Sep 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I ran Scalafmt (but didn't PR my setup of it since I didn't know if you wanted it). It's a one line addition to the plugins file and a configurable settings file. I can make a PR with the setup and you can tune the config to your personal preferences!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this issue: #95.


/** Create a quantile discretizer which splits data into discrete evenly sized buckets.
*
* @param bucketCount The number of quantiles desired
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nitpicking here but I wouldn't describe bucketCount as The number of quantiles because the number of quantiles is always one less than the number of buckets. From Wikipedia: Quartiles: the three points that divide the data set into four equal groups in descriptive statistics.


private def computeQuantiles(target: Seq[Double], bucketCount: Int): Seq[(Double, Double)] = {
val binPercentileWidth = 1.0 / bucketCount
val targetArray = target.toArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I wrote the comment about target: Seq[Double] is that as a result, we copy each numerical column twice, instead of just once. The first time it is copied in def fit with x(::, colIndex).toScalaVector and the second time in def computeQuantiles with target.toArray.

The solution would be to change target: Seq[Double] to target: Array[Double] here and then create an array in def fit directly with .toArray which also makes a copy based on this.

Hope this makes sense and I'm not making a mistake reading this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I didn't understand the comment but makes sense now

.map(_.toDouble)
.map(DescriptiveStats.percentileInPlace(targetArray, _))
.sliding(2)
.map({case Seq(lowerBound, upperBound) => (lowerBound, upperBound)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be just .map { case Seq(lowerBound, upperBound) => (lowerBound, upperBound) }.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oop yeah, I always use parens but it's personal preference


override protected def transformSafe(model: QuantileDiscretizer, x: Features): Features = {
val xCopy = x.copy
model.featureIndex.numerical.columnIndices.zipWithIndex.foreach {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, it's elegant. I personally would format it as (just a subjective preference):

model.featureIndex.numerical.columnIndices.zipWithIndex.foreach { case (colIndex, bucketsIndex) =>
  val buckets = model.quantiles.getOrBreak(bucketsIndex)
  (0 until xCopy.rows).foreach { rowIndex =>
    xCopy(rowIndex, colIndex) = buckets.indexWhere { case (lowerBound, upperBound) =>
      lowerBound <= xCopy(rowIndex, colIndex) && xCopy(rowIndex, colIndex) <= upperBound
    }.toDouble
  }
}

import io.picnicml.doddlemodel.data.Feature.FeatureIndex
import io.picnicml.doddlemodel.data.Features
import io.picnicml.doddlemodel.syntax.OptionSyntax._
import io.picnicml.doddlemodel.typeclasses.Transformer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another line between the penultimate import and import scala.Double.{MaxValue, MinValue} (I used Optimize imports in IntelliJ which also reordered some of the imports).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use Emacs sadly, but I'll open IntelliJ for the import optimization (I have used Scalafix for similar things, but it's a big dependency for something IntelliJ does for free for most folks)

import scala.Double.{MaxValue, MinValue}

case class QuantileDiscretizer(
private val bucketCounts: DenseVector[Double],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly, you implemented this with IntVector initially and the tests failed (didn't compile)?

I tried using IntVector here and changing the DenseVector[Double]s to DenseVector[Int]s throughout the code/tests and the tests passed on all three supported versions of Scala for this project (2.11.12, 2.12.9 and 2.13.0). I got some mysterious error once that I can't reproduce anymore, but it went away after deleting the target/ folder and building again.

Could you please check again, deleting target/ if the problem persists? I'm not sure where the problem could be as I'm assuming you are using the dependency versions listed in project/Dependencies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaits review enhancement New feature or request
Projects
Kanban Board for doddle-model
  
Review in progress
Development

Successfully merging this pull request may close these issues.

None yet

3 participants