Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save_weights fails when large number of input features are present #95

Open
Guzzii opened this issue Jul 20, 2018 · 4 comments
Open

save_weights fails when large number of input features are present #95

Guzzii opened this issue Jul 20, 2018 · 4 comments

Comments

@Guzzii
Copy link
Contributor

Guzzii commented Jul 20, 2018

Hi @montanalow . This is really a great work. I really like how you abstract the common pitfalls in machine learning and streamline the process in this project. I see a lot of potential in this project from a data scientist perspective. If you don't mind, I can provide my feedback from using this tool.

For this particular issue, I encountered h5py error because of too many Input layers. As show here, we have to pass one encoder for each column in the dataframe, and each encoder corresponds to one Input layer. I deal with a lot of DNA sequence data which is usually >5000 columns. I think it makes sense to at least combine the columns using Continuous or Pass encoders into one Input.

@montanalow
Copy link
Contributor

@Guzzii This is something we've run into internally as well. The current work around is to set short_names = True, which will get you to hundreds, but probably not thousands of inputs.

What if encoders that share a common base name, followed by a number, e.g. 'sequence_1', 'sequence_2', 'sequence_3', ... 'sequence_n' were mapped into a single input of 'sequence' with shape(n), for all types where that is possible?

@Guzzii
Copy link
Contributor Author

Guzzii commented Jul 24, 2018

Hi montanalow. I think it makes sense. Just want to make sure if I understand correctly. In this case, it would aggregate columns with shared base name sequence_col_{}, and encoder generated input like one_hot_{}, respectively.

sequence_col_{} -> sequence (input_shape=n_1)
one_hot_{} -> one_hot (input_shape=n_2)

@montanalow
Copy link
Contributor

Correct. I think there will be a little bit of complexity around encoders that have a sequence_length like the Token encoder, because they will need to go to a 2D shaped input, but should still work in theory.

@metatron1973

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants