New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is this repository for? #61
Comments
Hi @rcurtin, I think PR #60 addresses some of these issues and hopefully the rest of the soon. New Directory :
In depth changes:
I would love to get your opinion, currently the PR isn't as refined as I think it should be so I am constantly working on improving it. If you get a chance kindly take a look. I know a PR with that many changes isn't easy to review so I would really appreciate if you could leave short notes/changes/suggestions. Future work:
Relevant Tickets: Well I think we can officially say that this is a long message. |
Also here is another comment that I made earlier this in a discussion in #57 :
Full discussion can be found in #57 |
In my opinion, this is what this repository should be. We did have a short discussion about this in #57 .
A couple months ago, @zoq and I discussed the possibility of using Jupyter notebooks for smaller programs and tutorials. Do you think we could employ Jupyter notebooks to handle this class of users? I haven't looked far enough into it to create a concrete plan, but I think it's a good idea nonetheless |
The transfer learning approach, Download the model with saved weight and just plugin your own dataset, May be an option to change input and output, I like the idea @rcurtin @sreenikSS work may come in handy here, Since the converter could convert many models and hence we don't have to explicitly write it may be |
@kartikdutt18 thanks for the response. I have no issue with long responses, if you haven't guessed. 😄 I think the proposal makes sense and would be a nice improvement to the repository. Any comments I would have are pretty minor (like, e.g., would CMake be the best way for users to specify that weights should be downloaded? Or should it be some other utility or way?) and personally I think that those individual comments can be handled later. So I would vote for @kartikdutt18's proposal for the overall structure.
@favre49 yeah, definitely, Jupyter notebooks would be nice. We can actually do those pretty easily with Python bindings, but if we want native C++ notebooks, we would have to use xeus-cling, which I believe still suffers from the multiple definitions issue (jupyter-xeus/xeus-cling#91). I think that would be a really effective way to do examples though. (Bonus: they're notebooks, so it's easy to test whether or not they run and therefore test our documentation!)
Definitely, we might be able to take existing TensorFlow model weights and just convert them directly for mlpack and store them on our servers for easy download. Overall I think we should leave this issue open for a little while longer to gather comments and opinions before deciding. :) |
Great, I have a PR open for this which will be a bit tedious for review. Currently I have implemented download using CMake but I think the best way would be to download them in run time if the option for using weights / datasets is set to
That would be great to add. Hopefully next week I'll add grouped convolutions to convolution networks (Already have a PR open, needs some debugging and rebasing) then we could also add weights from convolutional layers, so we would end up with a whole new domain of models whose weights we could use. |
Thanks @rcurtin for opening the issue and thanks for splitting users into groups. Personally, I would focus on "Folks who want to read the code here, understand it, and copy-paste it into their own applications.". So instead of trying to provide an executable that supports different datasets and tasks, have a single file that implements one concrete problem. Like classification of the cifar10 dataset using convolutional neural networks, another example could be collaborative filtering applied to the netflix dataset. The discussion right now seems to be neural network centric, but mlpack implements other machine learning methods as well. I think the most common user has a specific problem and just searches for a solution; a user will just copy the code and integrate it into there pipeline, so they don't necessarily need a method to load a dataset (not saying we don't need that, because it would make things easier) or an interface to set-up the method. I guess just a simple main method that has all the necessary steps would probably do the job. The idea is really similair to what notebooks have become. This is just my opinion, I don't want anyone to feel like I'm forcing an idea here, in the end, this should be a community decision; and we can always open another repository that has a focus on the other group. |
@zoq definitely we are drifted towards Ann a liltle bit , and I agree we definitely should have a examples sections which would handle first group of users Now we have to decide whether we should have it in this repo or seperate repo , personally I would not mix both of them |
Hey @zoq, @jeffin143, I think that something like this tutorial would make sense. It would have only a main function and we import rest of the model, data from the rest of the repo. I think there are following ways to handle this:
The examples won't contain CLI or anything special they would simple boil down : For second type of users, that are familiar with mlpack and anns (in general) and want ready to train / inference models on their data I think having a folder to solve their problem shouldn't be very hard. This code would be similar to that in the tutorial with difference of CLI and generalized code. Thanks. |
Just throwing in my 0.02$ here. I agree with @zoq that "Folks who want to read the code here, understand it, and copy-paste it into their own applications." would be the more common group of users and the ones to focus on. In my opinion, an It makes sense to me to move the "one-shot" tutorial files currently here to an Then this repo would be a collection of popular NN models to use in custom applications.
@kartikdutt18 put it much better than me and I think this should be what the models repo is for. |
Agreed, I don't think it makes sense to mix them, ideally, we can keep the repository as simple as possible. We learned a lot when we outsourced the optimization framework into its own repository (mlpack/ensmallen), from a user perspective you don't have to search through some folders that you don't need or aren't interested in.
I don't necessary have tutorials in mind; writing a good tutorial takes time, and often you don't really want to read a tutorial you just want the code, so what I had in mind is some kind of a sink I can find a bunch of "real-world" uses cases. If they are documented well and could be seen as a tutorial that is a plus.
Sounds like a good plan to me, I wouldn't name it tutorials, see comment above, but that's just a minor detail. Also, we should make sure this isn't going to be too complex. Like https://github.com/mlpack/models/blob/master/Kaggle/DigitRecognizerCNN/src/DigitRecognizerCNN.cpp is already pretty much what I have in mind, I don't necessarily want to link against the models repo just to include a certain model, if I could have that directly in my single file example.
The problem I see with the approach is dependencies, for some examples, it would be nice to have OpenCV, or anything else, it's not needed to build/use mlpack but for the example it is. |
Hmm, I hadn't considered that. However, I'm still of the opinion that |
Ahh, I think I finally understand. So the goal for that repo would be have code for some trivial problems (in a tutorial like manner) so a user could understand how to use different methods (layers / machine learning algorithms) in mlpack for some problems to help them get started.
Got it. I guess I finally understand what the tutorial repo (of course a better name would choosen) should look like. I guess having a tutorial readme would be a good idea, it might take time but I think that'll be really great.
Agreed, we might be using some tools that we don't need in mlpack, I think this would also allow us to add a bit more dependencies for certain portions of code. So I had a couple of questions then about my restructuring PR then:
Thanks a lot for this discussion, really learned a lot. |
The problem I had with this vs. having an examples folder in this repo is:
|
@favre49 , When I first came across models repo in mlpack, I thought it had saved models which I could preload and work upon, But then I came across the fact that it has some of the example how mlpack work and the functionalities it provide, Like CNN code which simple creates A CNN architecture So The whole understanding of models repo now to be is something as of below, Where The models would be predefined and you can just import and create an instance and then work on it base_model=MobileNet(weights='imagenet',include_top=False) #imports the mobilenet model and discards the last 1000 neuron layer.
x=base_model.output
x=GlobalAveragePooling2D()(x)
preds=Dense(4,activation='softmax')(x)
model=Model(inputs=base_model.input,outputs=preds)
tbCallback = keras.callbacks.TensorBoard(log_dir='/tmp/keras_logs', write_graph=True)
model.compile(optimizer='Adam',loss='categorical_crossentropy',metrics=['accuracy'])
history=model.fit(X_train, y_train, epochs=15, shuffle=True, verbose=1, validation_split=0.15, callbacks=[tbCallback]) And The example folder would be more of how can you create a layer and plugin your code and mipulate the hyper parameter and solve some base classification problem as example, This will not use the import model feature, It is bare minimum and does not depend on anything just like the present approach model = models.Sequential()
model.add(Dense((1), input_shape=(32,32,1)))
model.add(layers.Flatten())
model.add(Dense(100*100*3))
model.add(Reshape((100, 100,3)))
model.add(layers.Flatten())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(9, activation='softmax'))
model.summary()
tbCallback = keras.callbacks.TensorBoard(log_dir='/tmp/keras_logs', write_graph=True)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
history=model.fit(X_train, y_train, epochs=50, shuffle=True, verbose=1, validation_split=0.3, callbacks=[tbCallback] The above example, you don't have to create a mobile net arch since it is aldeardy pre built, but if you want to create one, you can refer the example folder to know how and what all you can use to build it.
Exactly We can use static Website builder and create a website or page which had those documentation or example, The CNN architecture was currently just on models repo but we can even use static builder to build a website or some kind of blog or just code snippets The whole point being the weights would have some size, I have seen big arch have size of their weight close to 100 mb and if we are going for 7 or 8 , and then If someone want's to just see examples and not use all these models then he has to clone the repo which is of potentially big size, Just in case of mlpack main repo, If some one want to build python binding he has to close the main repo and then carried on
We have repo for that, I guess that would suit there and then May be write a blog post about the speed optimisation and other things and link it to the repo
We have to adapt the strategy used in mlpack to generate binding then, @zoq @rcurtin would like your take on this
before that we should port it to other repo, if everyone votes in or else you shouldn't I guess we should definitely come to a conclusion, Since there is a lot of plans in hand and coming up or finalising one is difficult. |
Hmm, I think by benchmarking I meant user would pass his dataset or datasets that we support, set of models and we would return the best possible model for fixed (tunable) epochs and hyper-params. And standard dataset would allow me to just pass say Also weights and datasets will be downloaded as per use only. Currently I have datasets in a separate repo from where I download them in my PR. |
There are a lot of great propositions here. I agree with @rcurtin as there are a lot of folks who want to read the code rapidly, understand it, and copy-paste it directly into their code.
|
Awesome, great to see so many responses. 😄 I think we all agree that we definitely have two different kinds of users and it seems to me like it would be worthwhile to treat those users separately---everyone seems to agree (mostly?) that there should be two different repositories for these purpose. There will be some details to work out for sure, too. Luckily we have tons of suggestions. 🎉 @shrit's suggestion for git submodules could be really nice, and there are other nice detail suggestions too all over the thread. I think if we focus on the highest-level question first, then all the implementation details will fall into place easily. :) So, to that end, it seems like there's really only one question: Do we choose to keep just barebones simple examples in this repository and rename it to
What do we think? If we can get agreement on that question then working out the details should be a bit easier. I think I captured the range of opinions in just those two options, let me know if I missed one. 😄 |
I really like the git submodules idea, I think we should go for that. |
I think the first option is better as well. Its only con (?) would be an increasing number of repos, but that is much better than the dependencies problem of the second option. |
I agree with the general idea, but would it make more sense to create a new repo for the examples and move the code we already have over there, instead of renaming this on and create a new repo? I think most of the current PR's address the second type of users? |
I'm a little late. Some really great ideas here. Since we are en route to a final decision, I think rounding up the details wrt the perspective of a new user would be a good idea. That being said, let's assume I am a new user of mlpack but I have knowledge about keras/pytorch:
In case we go with @rcurtin's second option, my two cents would be to make sure that the simplicity of this repo is not affected so that starter code access and download remains as simple as it is now. For the first option of course no change in structure would be required for this repo except just renaming it. |
Hey @zoq, I think it might be a better idea to simply rename this repo and create a new one. Since the new repo would have a proper structure it might be easier to add models and other tools there once a structure is decided, rather than moving files from here to new repo and then restructuring this one. I think most active PRs can easily be shifted there. What do you think? |
Agreed, it should be super simple for users to run the code, but we can do that if we can keep the complexity low. I'm working on a simple notebook example, that I can hopefully show in the next days.
If that's fine with you, I think you opened most of the current PR's :) |
Awesome, it sounds like we are pretty much agreed. If @kartikdutt18 is okay porting all his PRs then we can probably rename this repository to To @sreenikSS's point, it seems like maybe the best idea for the The other thing that seems good to implement is the use of git submodules for the main mlpack repository, so that it's not hard for users who found that repository but not these two to end up finding the extra models and examples. For the models repository specifically, git submodules might be really useful as they could give a way to "piggyback" on the existing CMake configuration to build bindings for other languages, and let us easily produce Python bindings for those models, and so forth. I don't know the exact details of how the best way to do that will be... But anyway, maybe the best way forward is to make that name change, separate these two repositories, and see what we learn along the way? 😄 Unless there are any objections, I'll do the renaming/moving sometime on Thursday. |
It would be great as most of my PRs are more towards the second type of users so that will make a lot of sense.
I am fine with it, I have closed my PRs from this repo. However #56 is more related to examples repo so I think I'll keep it open. What do you think?
Agreed, that would be a lot better. It would make sense to remove Kaggle/Utils as most of the functions in it are implemented in mlpack.
Great, Looking forward to it. Thanks a lot. |
Awesome, thank you everyone for your input on this. 🎉 Now we have two repositories:
Both were created from the same codebase, so it should be easier to adapt them. I think we have a direction to go here. I can see some issues have already been opened to adapt this fully into an examples repository. If you see anything that should be done but hasn't been done yet feel free to open an issue or a PR. :) So far I am working on getting a first draft of an updated README for both, and setting up the utilities like mlpack-bot, labels, CI, etc. I'll go ahead and close this issue, and open up another issue to collect the set of things to be done before we're finished transitioning this repository. I already opened one for the models repository: mlpack/models#1. 👍 Thanks everyone! 💯 |
I've opened this issue to follow up on the discussion that we had in the video chat last week (CC: @shrit, @kartikdutt18). This kind of comes out of some comments I made on #55 and in some other places, where I thought this repository was a place to collect example implementations that people could base their own applications off of. However, I don't think that's necessarily what we have to make this repository focused on, and it seemed like there was a diversity of opinions on how to structure this repository.
In essence @shrit and @kartikdutt18 pointed out that there are people who might like to directly use the models in this repository off-the-shelf for their data. This would be why the use of the CLI framework would make sense here; however, a drawback is that that makes the actual code a little less easy to understand for users who just want a minimum working example that they can adapt.
Thus we have (at least) two kinds of users:
Folks who want to read the code here, understand it, and copy-paste it into their own applications. For them, this repository is kind of the equivalent of a collection of examples in the genre of the Hands-On Machine Learning notebooks (i.e. https://github.com/ageron/handson-ml/blob/master/01_the_machine_learning_landscape.ipynb) and the Keras examples directory (https://github.com/keras-team/keras/tree/master/examples). (There are lots of other repositories in this type of vein.)
Folks who don't want to read the code but directly use the model types that are available. Actually I think that this set of users might be closer to the original intention of the repository. For these users it would be awesome to have command-line programs that, e.g., train a model with a specific architecture, or download pretrained weights to make predictions, etc.
So, I opened this issue so that we can (a) work out how to best serve these types of users and (b) list any more types of users that we reasonably need to consider. :)
I'll also throw a proposal out there, and we can refine it and modify it.
We can handle the first class of users by either creating a separate "examples" repository that has extremely simple examples, or, by adding an
examples/
directory to this repository. (Or even the main mlpack repository??) This could contain some simple workflow examples like the LSTM examples, and even examples for non-neural network models, like the ones that are currently contained in the various tutorials that we have. Examples could be .cpp files, but also .sh/.py/.jl files that demonstrate usage of the mlpack bindings, for instance.We can handle the second class of users by turning this repository into a collection of specific bindings, in the same style as
src/mlpack/methods/
. Each directory can contain a model type. We might need some additional extra code to support downloading models, or something else. Models could easily be hosted on mlpack.org, as we currently aren't anywhere close to our maximum bandwidth costs. (That may eventually become more of a problem, but we can handle that when we get there.) Each binding can use the CLI system to handle input and output parameters, and we can use the CMake configuration ideas from the main mlpack repository to allow building of Python and Julia bindings, not just command-line programs. That way we could, e.g., provide turnkey models to other languages. (How we deploy those models and make them available is a separate issue, but shouldn't be too hard.)That's just an idea---I'm not necessarily married to it. If others have other ideas, please feel free to speak up! Honestly speaking, I don't really have the time to structure this repository in the way that we decide or maintain it thoroughly, so I don't want anyone to feel like I'm forcing an idea that I won't be around to see through. :)
The text was updated successfully, but these errors were encountered: