Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read-dataset can be improved and optimised #370

Open
piokuc opened this issue Aug 29, 2017 · 11 comments
Open

read-dataset can be improved and optimised #370

piokuc opened this issue Aug 29, 2017 · 11 comments

Comments

@piokuc
Copy link
Contributor

piokuc commented Aug 29, 2017

Hello. Recently I tried to read a fairly large CSV file (~60 columns and ~3000000 rows) using read-dataset and it turned out to be practically impossible on my laptop, it was very slow, I never had the patience to wait till the end of the process. It was also consuming much more RAM than I would have expected (I had read the file with Python's Pandas before). If anybody is interested in the data, please see the Kaggle Zillow Price competition.

I examined read-dataset and the io.clj submodule and found the following issues.

  1. Every single field of every row on the file is matched against a regular expression to test if this is a Long or not. This is slow and in many cases unnecessary because we know columns' types beforehand.

  2. There is no way to specify desired column types, i.e. often I would like to work with Float rather than Double in order to save memory.

  3. For large datasets with homogeneous data types (e.g. all columns are float) it is desirable to pack rows into unboxed vectors which are much more compact than a regular heterogeneous Clojure vector.

I have developed above listed improvements. I decided to add some extra keyword parameters to read-dataset, trigger the optimised behaviour when they are present, and preserve the old behaviour otherwise, to ensure compatibility with existing client code. The new keywords are:

  • :default-type (default nil) default type of columns,
  • :types (default nil) dictionary mapping types to a column name or a list of column names, e.g: {Long ["foo" "bar"] Float "boo"}

Additionally, I added following keywords and functionality often useful when reading datasets encoded as CSV or similar:

  • :transformers (default nil) dictionary mapping column names to functions that will transform strings in a given column before they are converted to final types
  • :max-rows (default nil) maximum rows to be read, nil means no limit
  • :rename-columns (default nil) dictionary mapping column names on file to their names after loading

I will create a pull request in a minute. I will be very grateful for a discussion on the design and functionality of the read-dataset function, as well as the code style - I am new to Clojure, the code is probably not very idiomatic.

@joinr
Copy link

joinr commented Aug 29, 2017

I'm looking at porting some of the stuff I wrote for exactly this issue, for another lib called SPORK.

https://github.com/joinr/spork/blob/master/src/spork/util/table.clj
https://github.com/joinr/spork/blob/master/src/spork/util/parsing.clj

I put this together before incanter took off, and well before the column-store version of dataset came about. The biggest problem is incanter's parse-string function; it's way too general and slow. If you have a schema, or can derive a schema from the first couple of lines of the data, then you can parse an order of magnitude faster. My approach combines a flexible parsing-schema (via spork.util.parsing), with functions to convert delimited text to spork.util.tables that take schemas. From there, the field parsers kick in, defaulting to something like parse-string if no default parser is specified.

A second big problem is how bloated strings are, particularly when you have a small set of values for a column/field. I got around this by using string pooling, in spork.util.string, which provides some fine-grained control over a string pool that re-uses string references if they exist in the pool. Note: this is supposedly going to be addressed in java 1.9, with some built-in string pooling facilities that "just work."

Finally, I just recently extended the core.matrix dataset protocols to spork.util.table.columntable type [in a separate application]. I'm looking at porting that implementation over to spork, but that'd require a core.matrix dependency for a dataset protocol (kind of ungainly).

With the aforementioned solutions in place, I get a decent platform for processing mixed text/numeric data.

@mikera
Copy link
Contributor

mikera commented Aug 30, 2017

Can I request we use a map of options rather than keyword parameters? This makes it much easier to pass and merge maps of parameters around

@mikera
Copy link
Contributor

mikera commented Aug 30, 2017

My suggested strategy is as follows:

  • Make use of https://github.com/clojure/data.csv for reading CSV files. No need to reinvent any wheels here, surely? It's pretty fast and well supported, and it isn't a large dependency to add.
  • Datasets should be read into a core.matrix dataset. This gives immediate access to useful functions like emap-column for column transforms, for example. I think incanter already does this to some extent, but the exact behaviour may need some more testing.
  • I like the idea of a parameterized read-dataset function that allows custom column transformers etc.

@jaromil
Copy link

jaromil commented Aug 30, 2017

I agree with @mikera that, considering current Intanter's 2.0 design, the first step to improve speed should be to parse first to core.matrix, then perhaps build the functionalities proposed by @piokuc for its transformation into a dataset.

@joinr
Copy link

joinr commented Aug 30, 2017

Decoupling parsing schemes (map of field to parsing function or default parsers like :text :float, etc) from reading delimited lines of text (data.csv) is extremely useful in practice. That allows callers to provide their own parsing functions or opt in to prepackaged defaults. Going a little further, you can infer parsing schemas for an unknown dataset by scanning some lines (R datatable for this as I recall), so callers can get a performant parsing scheme without explicitly asking for it.

I think parsing to matrix first (for speed) sidesteps the class of problems where there are many columns of text (categorical data, not the homogeneous numeric case alluded to earlier). String compression via pooling is easily accomplished with a parser that wraps a function like str with a check against references in a string pool. This can still be deferred to the caller (provide your own pooled parser if you need it....) but it's a common enough problem that I think it'd be useful to have pooling baked in, or provided as a type of field parser. Creating millions of non-canonical string instances which could be pooled (I.e you have 3 distinct strings over millions if entries, but jvm doesn't realize it) slows everything down due to allocation and hogs memory. Other platforms (R datatable for instance) kick incanter to the curb in speed and resource utilization....leading to the perception incanter is less than useful for "real" applications.

Finally, why not extract read-dataset to a protocol fn as part of the dataset protocol? There may be custom binary formats that can be read far faster than delimited text. For that matter, wrapping dataset in a smaller incanter-specific protocol would be great; extending the core.matrix dataset protocols to a custom data structure (like the optionally primitive typed column store from spork.util.table) required quite a bit of matrix protocol wrangling, much of which seemed redundant. There may even be use cases where the original incanter rowstore dataset might have optimal access patterns, or the user wants to provide a custom implementation.

@jaromil
Copy link

jaromil commented Aug 30, 2017

FWIW, after reading @joinr arguments I am convinced, thanks for taking the time! I recommend merging PR #371 and set the future goal of extracting read-dataset to a protocol fn as part of the dataset protocol, allowing the implementation of more optimised protocol functions. This hook should be prior to processing with core.matrix. Meanwhile the proposed PR seems to me a pretty sane and non-intrusive way to improve resource and speed.

@mikera
Copy link
Contributor

mikera commented Aug 30, 2017

Some thoughts:

  • There are potentially multiple input sources of data. data.csv is just one way, but you can usually expect data to come in as something like a sequence of rows. A spark DataFrame would do something similar, for example. Or you could take from a infinite sequence of randomly generated rows.
  • You ultimately need to choose a data structure to store incremental results while you read. Conjing into Clojure vectors for each column (possibly using transients) is probably a sensible default option.
  • Being able to load lazily is crucial for big datasets (allowing stuff like string compression etc.). So ideally you want to do the transformations / parsing as you read in each row.
  • If you conj into columns, constructing a core.matrix dataset at the end is essentially free (by calling `ds/dataset-from-columns' you are effectively just initialising a wrapper containing the references to columns and column names)

@piokuc
Copy link
Contributor Author

piokuc commented Aug 30, 2017

Thanks a lot for all the comments, lots of good ideas here. I am a bit busy right now but I should have more time next week and I will be happy to work on implementing the suggested changes and features.

I think the use case that I mentioned, namely loading all numeric, homogeneous columns is an important one, but I am certainly interested in something generic, allowing for heterogeneous columns, like Pandas' DataFrame or R's datatable. Ultimately, I'd like to try and work towards a coherent, efficient yet flexible Clojure ecosystem for processing data, so I want to study core.matrix and SPORK, as well as Spark DataFrame, and see if they can all work together.

I agree with @joinr that loading data is the first thing someone trying to use Incanter does and those with R or Pandas experience may be less than impressed with Incanter at its current state. That's exactly what I've been trying to improve, in a non-breaking, backwards compatible manner. Therefore, I'd be happy to have this PR merged first, as @jaromil suggested, and then carry on working on the ideas proposed by @joinr and @mikera. Ultimately I want to be able to use Clojure to do things that I can do with Pandas and more, for example make use of Spark DataFrame and other powerful stuff available on JVM these days.

@mikera
Copy link
Contributor

mikera commented Aug 30, 2017

This should certainly work with heterogeneous columns! I think that is a definite requirement.

Note that core.matrix datasets are specifically deigned to handle the heterogeneous case, you can even mix columns using different core.matrix implementations.

@joinr
Copy link

joinr commented Aug 30, 2017

For what it's worth, in spork.util.table, if a schema is provided that has primitive types, for those columns I build primitive transient core.rrbvectors of the corresponding type and build the dataset incrementally. Why rrbvector? At the time ( early last year?) Primitive vectors in clojure core didn't support transient. Rrb served as a drop in replacement. That may no longer be the case though.

The spark dataframe is an interesting point. It's a different implementation that we should be able to wrap and use incanter's (currently core.matrix) dataset protocol(s) against. I don't have a use case (not using spark) but that could be an excellent mutable? implementation of an incanter compatible dataset that could be wrapped by a interested parties...

example of extending dataset protocols to spork.util.table.column-table.

@joinr
Copy link

joinr commented Aug 30, 2017

One alibi regarding multiple row-like input sources (seqs/reducibles of lines are but one)...having the ability to build datasets from columns or seqs if records is very useful (core.matrix and incanter dataset cover this pretty well).

I think @piokuc has a flexible approach in allowing the caller to provide optional functions that the dataset reader (read-dataset) composes (with sane defaults). @mikera is right about not jamming too many optional args into the function though. Case in point: renaming columns is a trivial operation that can be composed after building the initial dataset. The data science with clojure book uses multi methods to neatly encapsulate these little intermediate cleaning operations. In typically have constructor fns that serve the same purpose. I'd vote to separate that concern (renaming columns) unless new users are reaching for it a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants