Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High level model builder #22

Closed
tbreloff opened this issue Jun 30, 2015 · 14 comments
Closed

High level model builder #22

tbreloff opened this issue Jun 30, 2015 · 14 comments

Comments

@tbreloff
Copy link
Collaborator

Sometimes it can be a daunting task to select the appropriate model for a given dataset. It would be great to provide a helper framework (possibly a separate package or business) that could help choose/setup the best model given high-level information about the data set:

  • Dimensionality?
  • Time-varying/Ordered Data?
  • Expected multicolinearity? (can check this ourselves of course)
  • Classification? 2 classes or more than 2?
  • Regression? Nonlinear?
  • etc

I've seen many graphics which essentially create a decision tree given lots of high-level information about a data problem, and point you at the right solution type (linear regression vs logistic regression vs dimensionality reduction vs SVM vs random forests vs ???). I think this is something that could be implemented alongside an ensemble framework which could choose lots of candidate models for you and drop/average/vote on the best predictions. In the online setting, ensembles could be relatively cheap, even for large datasets (especially if the online algorithm allows for parallel fitting)

(It's conceivable that this could be a SaaS business in its own right... high level online data science platform built on top of OnlineStats and OnlineAI)

@joshday
Copy link
Owner

joshday commented Jun 30, 2015

This is an interesting idea. I don't think I've seen anything that tries to automate model selection like this. It would be an easy and powerful tool, especially since many online algorithms can be designed to be self-tuning. I'm intrigued. Let's talk more.

@Evizero
Copy link

Evizero commented Sep 4, 2015

Hi! Just here to drop some links

I've seen many graphics which essentially create a decision tree given lots of high-level information about a data problem, and point you at the right solution type (linear regression vs logistic regression vs dimensionality reduction vs SVM vs random forests vs ???)

The most famous one is probably from scikit-learn

... an ensemble framework which could choose lots of candidate models for you and drop/average/vote on the best predictions. In the online setting, ensembles could be relatively cheap, even for large datasets (especially if the online algorithm allows for parallel fitting)

There is an interesting reference python implementation concerning automatic ensemble building

@tbreloff
Copy link
Collaborator Author

tbreloff commented Sep 4, 2015

Thanks for the links. I'm starting to work on ensembles in my package OnlineAI.jl, which extends OnlineStats. I'll certainly use this as a reference.

On Sep 4, 2015, at 1:26 PM, Christof Stocker notifications@github.com wrote:

Hi! Just here to drop some links

I've seen many graphics which essentially create a decision tree given lots of high-level information about a data problem, and point you at the right solution type (linear regression vs logistic regression vs dimensionality reduction vs SVM vs random forests vs ???)

The most famous one is probably from scikit-learn

... an ensemble framework which could choose lots of candidate models for you and drop/average/vote on the best predictions. In the online setting, ensembles could be relatively cheap, even for large datasets (especially if the online algorithm allows for parallel fitting)

There is an interesting reference python implementation concerning automatic ensemble building


Reply to this email directly or view it on GitHub.

@Evizero
Copy link

Evizero commented Sep 4, 2015

What is your position on callback functions? (this question actually goes to both of you for OnlineAi and OnlineStats). You two seem to be doing a very good job and also seem to be very active so I would really love to use your work where it makes sense. I do have the design restriction that I would require callback functions that ideally support early stopping. OnlineStats seems to offer this if I use the low-level API as far as I can tell with the update! methods.

Background: I am working on a supervised learning front end (somewhat inspired by scikit learn and caret among others) where I also work on data abstractions for file streaming / in-memory data sets in various forms. I am currently investigating what libraries to use as back-end for specific things. Deterministic optimization seems pretty much set (pending some PRs / issues here and there) on Optim.jl for low-level access, and Regression.jl. Where I am unsure is stochastic optimization. There is SGDOptim.jl but it's not really actively (at least visible) being worked on. I'm also considering Mocha.jl but it does come with a lot of baggage. Your two projects seem very promising in that regard.

What are your thoughts on this?

@tbreloff
Copy link
Collaborator Author

tbreloff commented Sep 4, 2015

You should look through the source in
https://github.com/tbreloff/OnlineAI.jl/tree/master/src/nnet. I'm working on a bunch of things that you might be interested in, including various ways to split and sample static datasets, various stochastic gradient algorithms, and lots of cool (and easy to use!) neural net stuff... Dropout, regularization, flexible cost functions and activations, and even a normalization technique that I haven't seen anywhere else which I converted into an online algorithm (google "Batch Normalization"). In my opinion, it's much easier to use than something like Mocha.jl, and opens up streaming or parallel algorithms for big data sets. Not to mention you can combine and leverage all of OnlineStats, including the cool "stream" macro I made.

As for you questions on callbacks... My thought is that the functionality of nnet/solver.jl will end up embedded in the update function, and things like early stopping could be accomplished by setting certain flags and occasionally triggering callbacks to check against a validation set. I'm still actively thinking through design, and my goal is for something that should cover your needs.

On Sep 4, 2015, at 2:46 PM, Christof Stocker notifications@github.com wrote:

What is your position on callback functions? (this question actually goes to both of you for OnlineAi and OnlineStats). You two seem to be doing a very good job and also seem to be very active so I would really love to use your work where it makes sense. I do have the design restriction that I would require callback functions that ideally support early stopping. OnlineStats seems to offer this if I use the low-level API as far as I can tell with the update! methods.

Background: I am working on a supervised learning front end (somewhat inspired by scikit learn and caret among others) where I also work on data abstractions for file streaming / in-memory data sets in various forms. I am currently investigating what libraries to use as back-end for specific things. Deterministic optimization seems pretty much set (pending some PRs / issues here and there) on Optim.jl for low-level access, and Regression.jl. Where I am unsure is stochastic optimization. There is SGDOptim.jl but it's not really actively (at least visible) being worked on. I'm also considering Mocha.jl but it does come with a lot of baggage. Your two projects seem very promising in that regard.

What are your thoughts on this?


Reply to this email directly or view it on GitHub.

@Evizero
Copy link

Evizero commented Sep 4, 2015

I am absolutely interested in the neural net stuff. I will look into the code in close detail.

Concerning callbacks: I do have some time before I get to include stochastic optimization, so don't feel rushed.

Something that troubles me at first glance: Do I see right that you use the matrix rows to denote observations? I know this is the usual notation in textbooks but as far as I know from julia using the columns to denote the observations is better for performance because of the array memory layout

@tbreloff
Copy link
Collaborator Author

tbreloff commented Sep 4, 2015

Yes I think Josh and I were both more concerned with getting the code correct... I made the decision early on that I could live with the performance implications of row-based matrices. I'm holding out hope that we'll have performant row-based array storage in Julia at some point (even if I have to implement it myself), because no matter how hard I try I find column-based storage annoying to use.

On Sep 4, 2015, at 3:51 PM, Christof Stocker notifications@github.com wrote:

I am absolutely interested in the neural net stuff. I will look into the code in close detail.

Concerning callbacks: I do have some time before I get to include stochastic optimization, so don't feel rushed.

Something that troubles me at first glance: Do I see right that you use the matrix rows to denote observations? I know this is the usual notation in textbooks but as far as I know from julia using the columns to denote the observations is better for performance because of the array memory layout


Reply to this email directly or view it on GitHub.

@tbreloff
Copy link
Collaborator Author

tbreloff commented Sep 4, 2015

Also remember that you can update one point at a time by looping over the columns of a column-based matrix... You just lose the short helper function which does the loop for you.

On Sep 4, 2015, at 3:51 PM, Christof Stocker notifications@github.com wrote:

I am absolutely interested in the neural net stuff. I will look into the code in close detail.

Concerning callbacks: I do have some time before I get to include stochastic optimization, so don't feel rushed.

Something that troubles me at first glance: Do I see right that you use the matrix rows to denote observations? I know this is the usual notation in textbooks but as far as I know from julia using the columns to denote the observations is better for performance because of the array memory layout


Reply to this email directly or view it on GitHub.

@Evizero
Copy link

Evizero commented Sep 4, 2015

because no matter how hard I try I find column-based storage annoying to use

I absolutely agree on that.

However, it does kinda make it hard to interface the library when the column-based format (which I do) looping through the columns should probably do the trick for me as you just described.

I have seen the TransposeView{T} which seems like a good way to internally pretend it's a row-based index. Maybe that might be a solution to make use of the column based performance without the sacrifice of code clarity. Or what is this class for?

@tbreloff
Copy link
Collaborator Author

tbreloff commented Sep 4, 2015

TransposeView may work for this (or at least be the beginning of an implementation). I made it so that I could create "tied matrices" in stacked autoencoders... Essentially the weight matrix from one layer is the transpose of the weight matrix from a previous layer. This was straightforward since the layers now share the same underlying matrix.

On Sep 4, 2015, at 6:06 PM, Christof Stocker notifications@github.com wrote:

because no matter how hard I try I find column-based storage annoying to use
I absolutely agree on that.

However, it does kinda make it hard to interface the library when the column-based format (which I do) looping through the columns should probably do the trick for me as you just described.

I have seen the TransposeView{T} which seems like a good way to internally pretend it's a row-based index. Maybe that might be a solution to make use of the column based performance without the sacrifice of code clarity. Or what is this class for?


Reply to this email directly or view it on GitHub.

@joshday
Copy link
Owner

joshday commented Sep 5, 2015

I've been traveling...Tom seems to have your questions well covered, but I'll chime in here. I'd love to stay updated with what you're working on and what you'd like to see in OnlineStats. My next OnlineStats project is variance components models, but I'm happy to work on things people are actually using.

@joshday
Copy link
Owner

joshday commented Sep 14, 2016

This is definitely JuliaML material.

@joshday joshday closed this as completed Sep 14, 2016
@tbreloff
Copy link
Collaborator Author

Oh man... I can't believe this was a year ago!

@joshday
Copy link
Owner

joshday commented Sep 14, 2016

Is this essentially the birthplace for @tbreloff's vision of JuliaML? It's a part of history, now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants