Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is this repository for? #61

Closed
rcurtin opened this issue Mar 9, 2020 · 28 comments
Closed

What is this repository for? #61

rcurtin opened this issue Mar 9, 2020 · 28 comments

Comments

@rcurtin
Copy link
Member

rcurtin commented Mar 9, 2020

I've opened this issue to follow up on the discussion that we had in the video chat last week (CC: @shrit, @kartikdutt18). This kind of comes out of some comments I made on #55 and in some other places, where I thought this repository was a place to collect example implementations that people could base their own applications off of. However, I don't think that's necessarily what we have to make this repository focused on, and it seemed like there was a diversity of opinions on how to structure this repository.

In essence @shrit and @kartikdutt18 pointed out that there are people who might like to directly use the models in this repository off-the-shelf for their data. This would be why the use of the CLI framework would make sense here; however, a drawback is that that makes the actual code a little less easy to understand for users who just want a minimum working example that they can adapt.

Thus we have (at least) two kinds of users:

  • Folks who want to read the code here, understand it, and copy-paste it into their own applications. For them, this repository is kind of the equivalent of a collection of examples in the genre of the Hands-On Machine Learning notebooks (i.e. https://github.com/ageron/handson-ml/blob/master/01_the_machine_learning_landscape.ipynb) and the Keras examples directory (https://github.com/keras-team/keras/tree/master/examples). (There are lots of other repositories in this type of vein.)

  • Folks who don't want to read the code but directly use the model types that are available. Actually I think that this set of users might be closer to the original intention of the repository. For these users it would be awesome to have command-line programs that, e.g., train a model with a specific architecture, or download pretrained weights to make predictions, etc.

So, I opened this issue so that we can (a) work out how to best serve these types of users and (b) list any more types of users that we reasonably need to consider. :)

I'll also throw a proposal out there, and we can refine it and modify it.

  1. We can handle the first class of users by either creating a separate "examples" repository that has extremely simple examples, or, by adding an examples/ directory to this repository. (Or even the main mlpack repository??) This could contain some simple workflow examples like the LSTM examples, and even examples for non-neural network models, like the ones that are currently contained in the various tutorials that we have. Examples could be .cpp files, but also .sh/.py/.jl files that demonstrate usage of the mlpack bindings, for instance.

  2. We can handle the second class of users by turning this repository into a collection of specific bindings, in the same style as src/mlpack/methods/. Each directory can contain a model type. We might need some additional extra code to support downloading models, or something else. Models could easily be hosted on mlpack.org, as we currently aren't anywhere close to our maximum bandwidth costs. (That may eventually become more of a problem, but we can handle that when we get there.) Each binding can use the CLI system to handle input and output parameters, and we can use the CMake configuration ideas from the main mlpack repository to allow building of Python and Julia bindings, not just command-line programs. That way we could, e.g., provide turnkey models to other languages. (How we deploy those models and make them available is a separate issue, but shouldn't be too hard.)

That's just an idea---I'm not necessarily married to it. If others have other ideas, please feel free to speak up! Honestly speaking, I don't really have the time to structure this repository in the way that we decide or maintain it thoroughly, so I don't want anyone to feel like I'm forcing an idea that I won't be around to see through. :)

@kartikdutt18
Copy link
Member

kartikdutt18 commented Mar 10, 2020

Hi @rcurtin, I think PR #60 addresses some of these issues and hopefully the rest of the soon.
I hope this message doesn't get too long. So here are the changes I made :

New Directory :

--models
        -- data (Empty everything will be download based on user's cmake)

	 -- dataloaders (Input dataset name like "mnist" or path to your csv)

	 -- examples (A folder for those who wan't to learn mlpack)

		--mnist_tutorial

			--mnist_tutorial_cnn.cpp (Using LeNet)

			--mnist_tutorial_nn.cpp (Using NN)

		--Similar structure for LSTMs.

	-- models (All models implemented as a class)

		--lenet (All lenet models (1, 4, 5))
		--simple_nn (An FFN model)

	-- utils
		-- utils.hpp 
                -- Useful functions such as GetLabels to get labels for predictions
	-- weights
		-- Add weights for model (Download only, will be empty in repo)
	-- tests
		Added tests to download, load, train and perform inference using models.

In depth changes:

  1. data:
    Using cmake user will specify either to download all models, datasets or download a particular file using
    -DDOWNLOAD_MNIST=ON or -DDOWNLOAD_DATASETS=ON

  2. weights:
    Using cmake user will specify either to download all models -DDOWNLOAD_LENET_WEIGHTS=ON or -DDOWNLOAD_WEIGHTS=ON

  3. dataloaders:
    Useful for any one, if you enter a path then it will load the csv and perform train, test-split and add augmentation. Needs to be overloaded for adding support folders.

  4. examples:
    So here we will have different folders. Currently mnist_tutorial folder is there. It shows how to use mlpack to use / create model (Needs to be heavily documented). Separate executable for each file.
    User can run them directly to train and infer on mnist dataset or learn from here.

  5. tests:
    For contributors only, a way to test models that are added and ensure that they work.

  6. Don't know the name of the folder so let's call it fastmlpack for now.
    A file called object_detection.cpp where user will give options like model_name (currently one of lenet1, lenet4, lenet5, alexnet and hopefully vgg19), ratio, file_path, epochs, use_early_stopping and some more options. Then one can get a trained model and eventually get a saved model that is ready for inference. Currently for CLI I planned two things (not sure which I would end up going with):

    1. Use argc, argv (It would work, not really cool though)
    2. Write models compact CLI just for models and dataset (shorter and easier to maintain however much more difficult to implement, really cool)
    3. wait for mlpack CLI.

I would love to get your opinion, currently the PR isn't as refined as I think it should be so I am constantly working on improving it. If you get a chance kindly take a look. I know a PR with that many changes isn't easy to review so I would really appreciate if you could leave short notes/changes/suggestions.

Future work:

  1. Once a track for adding models is set, we should get more contributors to add more models.
  2. Nearly perfect documentation (as close as we can get).
  3. Add Bindings for this repo.

Relevant Tickets:
PR : #60 , Original issue : #57 , #40 .

Well I think we can officially say that this is a long message.
Also for new contributors I suggest going through the conversation in #57 to get a bit more context.
I would really love any opinions regarding this as this aims to completely restructure models repo and hopefully we only want to do that once.

@kartikdutt18
Copy link
Member

Also here is another comment that I made earlier this in a discussion in #57 :

Here are the changes that I think might be useful:

Models should be included as libraries / classes (sample can be seen in PR #50). So a user can call models/alexnet.hpp and use alexnet as base network for something else.

DataLoaders Folder (Will be pushing changes for it this week in PR#50). Here we should provide a wrapper around data::Load, scalers and data::Split. Especially useful for known datasets such as MNIST. This will also include support for resizing etc.

A unified samples suggested an examples folder (Refer #50) , where a call to each class should be explained. Eg. Object Detection Sample that has sample code for training and validation on a dataset. Similar file for object localization and so on.

A testing folder (this will be for contributors) where they will add test to load weights train for 5 epochs and set a reasonable accuracy as baseline. This will ensure addition of valid models as well as streamline addition of models.

All data in one folder. I would however prefer if we made changes in cmake to download data rather than storing it in repo to make models repo lighter.

Weights folder in models folder where weights of each models is stored.

Why it might be useful:

Shorter and cleaner code for user. (Load data using data loaders,import model and train).

Better for addition of models.

Refined UI.

Helpful for new users to easily train models.

Full discussion can be found in #57

@favre49
Copy link
Member

favre49 commented Mar 12, 2020

We can handle the second class of users by turning this repository into a collection of specific bindings, in the same style as src/mlpack/methods/. Each directory can contain a model type. We might need some additional extra code to support downloading models, or something else. Models could easily be hosted on mlpack.org, as we currently aren't anywhere close to our maximum bandwidth costs. (That may eventually become more of a problem, but we can handle that when we get there.) Each binding can use the CLI system to handle input and output parameters, and we can use the CMake configuration ideas from the main mlpack repository to allow building of Python and Julia bindings, not just command-line programs. That way we could, e.g., provide turnkey models to other languages. (How we deploy those models and make them available is a separate issue, but shouldn't be too hard.)

In my opinion, this is what this repository should be. We did have a short discussion about this in #57 .

We can handle the first class of users by either creating a separate "examples" repository that has extremely simple examples, or, by adding an examples/ directory to this repository. (Or even the main mlpack repository??) This could contain some simple workflow examples like the LSTM examples, and even examples for non-neural network models, like the ones that are currently contained in the various tutorials that we have. Examples could be .cpp files, but also .sh/.py/.jl files that demonstrate usage of the mlpack bindings, for instance.

A couple months ago, @zoq and I discussed the possibility of using Jupyter notebooks for smaller programs and tutorials. Do you think we could employ Jupyter notebooks to handle this class of users? I haven't looked far enough into it to create a concrete plan, but I think it's a good idea nonetheless

@jeffin143
Copy link
Member

The transfer learning approach, Download the model with saved weight and just plugin your own dataset, May be an option to change input and output, I like the idea @rcurtin

@sreenikSS work may come in handy here, Since the converter could convert many models and hence we don't have to explicitly write it may be

@rcurtin
Copy link
Member Author

rcurtin commented Mar 13, 2020

@kartikdutt18 thanks for the response. I have no issue with long responses, if you haven't guessed. 😄 I think the proposal makes sense and would be a nice improvement to the repository. Any comments I would have are pretty minor (like, e.g., would CMake be the best way for users to specify that weights should be downloaded? Or should it be some other utility or way?) and personally I think that those individual comments can be handled later. So I would vote for @kartikdutt18's proposal for the overall structure.

A couple months ago, @zoq and I discussed the possibility of using Jupyter notebooks for smaller programs and tutorials. Do you think we could employ Jupyter notebooks to handle this class of users? I haven't looked far enough into it to create a concrete plan, but I think it's a good idea nonetheless

@favre49 yeah, definitely, Jupyter notebooks would be nice. We can actually do those pretty easily with Python bindings, but if we want native C++ notebooks, we would have to use xeus-cling, which I believe still suffers from the multiple definitions issue (jupyter-xeus/xeus-cling#91). I think that would be a really effective way to do examples though. (Bonus: they're notebooks, so it's easy to test whether or not they run and therefore test our documentation!)

@sreenikSS work may come in handy here, Since the converter could convert many models and hence we don't have to explicitly write it may be

Definitely, we might be able to take existing TensorFlow model weights and just convert them directly for mlpack and store them on our servers for easy download.

Overall I think we should leave this issue open for a little while longer to gather comments and opinions before deciding. :)

@kartikdutt18
Copy link
Member

Great, I have a PR open for this which will be a bit tedious for review. Currently I have implemented download using CMake but I think the best way would be to download them in run time if the option for using weights / datasets is set to true. I am currently working on a generic data loader to support various popular datasets such that more datasets can be added without any hassle. Then I will start with tests for all models and API (by tomorrow I hope). I know it's a bit too much to ask but would it be possible to add style checks to this repo as well?

@sreenikSS work may come in handy here, Since the converter could convert many models.

Definitely, we might be able to take existing TensorFlow model weights and just convert them directly for mlpack and store them on our servers for easy download.

That would be great to add. Hopefully next week I'll add grouped convolutions to convolution networks (Already have a PR open, needs some debugging and rebasing) then we could also add weights from convolutional layers, so we would end up with a whole new domain of models whose weights we could use.

@zoq
Copy link
Member

zoq commented Mar 13, 2020

Thanks @rcurtin for opening the issue and thanks for splitting users into groups.

Personally, I would focus on "Folks who want to read the code here, understand it, and copy-paste it into their own applications.". So instead of trying to provide an executable that supports different datasets and tasks, have a single file that implements one concrete problem.

Like classification of the cifar10 dataset using convolutional neural networks, another example could be collaborative filtering applied to the netflix dataset. The discussion right now seems to be neural network centric, but mlpack implements other machine learning methods as well.

I think the most common user has a specific problem and just searches for a solution; a user will just copy the code and integrate it into there pipeline, so they don't necessarily need a method to load a dataset (not saying we don't need that, because it would make things easier) or an interface to set-up the method. I guess just a simple main method that has all the necessary steps would probably do the job. The idea is really similair to what notebooks have become.

This is just my opinion, I don't want anyone to feel like I'm forcing an idea here, in the end, this should be a community decision; and we can always open another repository that has a focus on the other group.

@jeffin143
Copy link
Member

jeffin143 commented Mar 13, 2020

@zoq definitely we are drifted towards Ann a liltle bit , and I agree we definitely should have a examples sections which would handle first group of users

Now we have to decide whether we should have it in this repo or seperate repo , personally I would not mix both of them

@kartikdutt18
Copy link
Member

Hey @zoq, @jeffin143, I think that something like this tutorial would make sense. It would have only a main function and we import rest of the model, data from the rest of the repo.
For machine learning, I think simple tutorials in a folder should suffice. I think they are well documented in mlpack.org.
We also documentation for sample FFN there and how to use it on a dataset.

I think there are following ways to handle this:

  1. We can have models repo where all models are made available as classes so that they can be included in other files. And a models-tutorials repo, where we could import these models and have tutorial-like code as shown in the site above. This should help people get started.

  2. We can have examples folder here, which would be as well documented as above. Each folder with it's own ReadMe, explaining each line. And how to swap out the code we wrote for their problem.

The examples won't contain CLI or anything special they would simple boil down :
a) Loading data
b) Importing model
c) Training (or Fitting) and inference.

For second type of users, that are familiar with mlpack and anns (in general) and want ready to train / inference models on their data I think having a folder to solve their problem shouldn't be very hard. This code would be similar to that in the tutorial with difference of CLI and generalized code.
This refers to TensorFlow models repo. They have a tutorial folder with tutorials for different types of datasets and a separate folder called research which has object_detection_api which I think is the most commonly used API by students in computer-vision. If possible we can later add website tutorials or notebooks for those just getting started.
I would love to hear your opinion on this.

Thanks.

@sriramsk1999
Copy link

sriramsk1999 commented Mar 14, 2020

Just throwing in my 0.02$ here.

I agree with @zoq that "Folks who want to read the code here, understand it, and copy-paste it into their own applications." would be the more common group of users and the ones to focus on. In my opinion, an examples folder in the main mlpack repo would be the best way to go about serving this group.

It makes sense to me to move the "one-shot" tutorial files currently here to an examples folder, acting like a quick onboarding for new users on how to use mlpack. And that folder would be better placed on the main repo for easy access.

Then this repo would be a collection of popular NN models to use in custom applications.

We can have models repo where all models are made available as classes so that they can be included in other files. And a models-tutorials repo, where we could import these models and have tutorial-like code as shown in the site above.

@kartikdutt18 put it much better than me and I think this should be what the models repo is for.

@zoq
Copy link
Member

zoq commented Mar 14, 2020

Now we have to decide whether we should have it in this repo or seperate repo , personally I would not mix both of them

Agreed, I don't think it makes sense to mix them, ideally, we can keep the repository as simple as possible. We learned a lot when we outsourced the optimization framework into its own repository (mlpack/ensmallen), from a user perspective you don't have to search through some folders that you don't need or aren't interested in.

Hey @zoq, @jeffin143, I think that something like this tutorial would make sense. It would have only a main function and we import rest of the model, data from the rest of the repo.
For machine learning, I think simple tutorials in a folder should suffice. I think they are well documented in mlpack.org.
We also documentation for sample FFN there and how to use it on a dataset.

I don't necessary have tutorials in mind; writing a good tutorial takes time, and often you don't really want to read a tutorial you just want the code, so what I had in mind is some kind of a sink I can find a bunch of "real-world" uses cases. If they are documented well and could be seen as a tutorial that is a plus.

I think there are following ways to handle this:

1. We can have models repo where all models are made available as classes so that they can be included in other files. And a models-tutorials repo, where we could import these models and have tutorial-like code as shown in the site above. This should help people get started.

Sounds like a good plan to me, I wouldn't name it tutorials, see comment above, but that's just a minor detail. Also, we should make sure this isn't going to be too complex. Like https://github.com/mlpack/models/blob/master/Kaggle/DigitRecognizerCNN/src/DigitRecognizerCNN.cpp is already pretty much what I have in mind, I don't necessarily want to link against the models repo just to include a certain model, if I could have that directly in my single file example.

It makes sense to me to move the "one-shot" tutorial files currently here to an examples folder, acting like a quick onboarding for new users on how to use mlpack. And that folder would be better placed on the main repo for easy access.

The problem I see with the approach is dependencies, for some examples, it would be nice to have OpenCV, or anything else, it's not needed to build/use mlpack but for the example it is.

@sriramsk1999
Copy link

The problem I see with the approach is dependencies, for some examples, it would be nice to have OpenCV, or anything else, it's not needed to build/use mlpack but for the example it is.

Hmm, I hadn't considered that. However, I'm still of the opinion that examples would be better placed on the main repo (perhaps with each example listing its dependencies?) given that it grants them so much more visibility than an auxiliary repo.

@kartikdutt18
Copy link
Member

kartikdutt18 commented Mar 14, 2020

I don't necessary have tutorials in mind; writing a good tutorial takes time, and often you don't really want to read a tutorial you just want the code, so what I had in mind is some kind of a sink I can find a bunch of "real-world" uses cases. If they are documented well and could be seen as a tutorial that is a plus.

Ahh, I think I finally understand. So the goal for that repo would be have code for some trivial problems (in a tutorial like manner) so a user could understand how to use different methods (layers / machine learning algorithms) in mlpack for some problems to help them get started.

I don't necessarily want to link against the models repo just to include a certain model, if I could have that directly in my single file example.

Got it. I guess I finally understand what the tutorial repo (of course a better name would choosen) should look like. I guess having a tutorial readme would be a good idea, it might take time but I think that'll be really great.

The problem I see with the approach is dependencies, for some examples, it would be nice to have OpenCV, or anything else, it's not needed to build/use mlpack but for the example it is.

Agreed, we might be using some tools that we don't need in mlpack, I think this would also allow us to add a bit more dependencies for certain portions of code.

So I had a couple of questions then about my restructuring PR then:

  1. Should I remove examples folder?
  2. How to go about shifting examples to a new repo?
  3. Then aim of this repo would be to hold State of Art models for various problems that can be readily implemented for most of the datasets that user passes hopefully using CLI. And support for some standard datasets for benchmarking as well as some tests to check correctness of model. (I know there can't be a straight answer to this but I would love to improve on this with your thoughts).

Thanks a lot for this discussion, really learned a lot.

@favre49
Copy link
Member

favre49 commented Mar 14, 2020

Agreed, I don't think it makes sense to mix them, ideally, we can keep the repository as simple as possible. We learned a lot when we outsourced the optimization framework into its own repository (mlpack/ensmallen), from a user perspective you don't have to search through some folders that you don't need or aren't interested in.

The problem I had with this vs. having an examples folder in this repo is:

  1. Does something this simple warrant it's own repo? The way I envision it, it would be a repository that would look like a single folder of example.cpp files, essentially.
  2. Visibility of this repo to the public. But I think if we linked against it on the website or something, it could work.
  3. I'm confused about the difference between an example that employs a model defined in this folder (say, an example that shows how we could use YOLO for object detection) vs MNIST or k-means examples. Why does one belong here but the other in a repo by itself? The demarcation feels confusing to me.

@jeffin143
Copy link
Member

I'm confused about the difference between an example that employs a model defined in this folder (say, an example that shows how we could use YOLO for object detection) vs MNIST or k-means examples. Why does one belong here but the other in a repo by itself? The demarcation feels confusing to me.

@favre49 , When I first came across models repo in mlpack, I thought it had saved models which I could preload and work upon, But then I came across the fact that it has some of the example how mlpack work and the functionalities it provide, Like CNN code which simple creates A CNN architecture

So The whole understanding of models repo now to be is something as of below, Where The models would be predefined and you can just import and create an instance and then work on it

base_model=MobileNet(weights='imagenet',include_top=False) #imports the mobilenet model and discards the last 1000 neuron layer.

x=base_model.output
x=GlobalAveragePooling2D()(x)
preds=Dense(4,activation='softmax')(x)

model=Model(inputs=base_model.input,outputs=preds)

tbCallback = keras.callbacks.TensorBoard(log_dir='/tmp/keras_logs', write_graph=True)
model.compile(optimizer='Adam',loss='categorical_crossentropy',metrics=['accuracy'])
history=model.fit(X_train, y_train, epochs=15, shuffle=True, verbose=1, validation_split=0.15, callbacks=[tbCallback])

And The example folder would be more of how can you create a layer and plugin your code and mipulate the hyper parameter and solve some base classification problem as example, This will not use the import model feature, It is bare minimum and does not depend on anything just like the present approach

model = models.Sequential()
model.add(Dense((1), input_shape=(32,32,1)))
model.add(layers.Flatten())
model.add(Dense(100*100*3))
model.add(Reshape((100, 100,3)))
model.add(layers.Flatten())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(9, activation='softmax'))

model.summary()    
tbCallback = keras.callbacks.TensorBoard(log_dir='/tmp/keras_logs', write_graph=True)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
history=model.fit(X_train, y_train, epochs=50, shuffle=True, verbose=1, validation_split=0.3, callbacks=[tbCallback]

The above example, you don't have to create a mobile net arch since it is aldeardy pre built, but if you want to create one, you can refer the example folder to know how and what all you can use to build it.

Visibility of this repo to the public. But I think if we linked against it on the website or something, it could work.

Exactly We can use static Website builder and create a website or page which had those documentation or example, The CNN architecture was currently just on models repo but we can even use static builder to build a website or some kind of blog or just code snippets

The whole point being the weights would have some size, I have seen big arch have size of their weight close to 100 mb and if we are going for 7 or 8 , and then If someone want's to just see examples and not use all these models then he has to clone the repo which is of potentially big size, Just in case of mlpack main repo, If some one want to build python binding he has to close the main repo and then carried on

And support for some standard datasets for benchmarking

We have repo for that, I guess that would suit there and then May be write a blog post about the speed optimisation and other things and link it to the repo

user passes hopefully using CLI

We have to adapt the strategy used in mlpack to generate binding then, @zoq @rcurtin would like your take on this

Should I remove examples folder?

before that we should port it to other repo, if everyone votes in or else you shouldn't

I guess we should definitely come to a conclusion, Since there is a lot of plans in hand and coming up or finalising one is difficult.

@kartikdutt18
Copy link
Member

kartikdutt18 commented Mar 14, 2020

Hmm, I think by benchmarking I meant user would pass his dataset or datasets that we support, set of models and we would return the best possible model for fixed (tunable) epochs and hyper-params. And standard dataset would allow me to just pass say mnist and I would have initialized trainX, trainY etc that they could directly be passed into model.Train() without user changing anything.

Also weights and datasets will be downloaded as per use only. Currently I have datasets in a separate repo from where I download them in my PR.

@shrit
Copy link
Member

shrit commented Mar 16, 2020

There are a lot of great propositions here.
I thought about one which is more different.

I agree with @rcurtin as there are a lot of folks who want to read the code rapidly, understand it, and copy-paste it directly into their code.
For this reason, @kartikdutt18 has proposed to create an examples folder that has these examples, which is great. I have two ideas to add here:

  1. Instead of mixing models and examples, I think we have to rename all this repository into examples instead of models to remove all ambiguity related to word models (since this can be understood as trained models, weights, some specific neural network architectures, etc...).
  2. We can address the other type of users in a new repository, this can be done by moving all the executables methode mlpack_* from mlpack, into the new models repository. However, we do not have to change the actual cmake system for mlpack, we can add the new models repository as a git submodule, which will be downloaded automatically if a user would like to compile a specific executable when typing make mlpack_knn.
  3. Add the new examples repository as a git submodule, this will improve the visibility of this repository for new users, each example can be provided with a small tutorial file, that does the documentation part. In my work, I use gitbook by default to document every example by explaining every piece of code for new users and providing a link at the end for the entire example, the gitbook part is written as markdown file with jupyter notebook style. This will allows users to understand it directly, and even copy-paste the interesting lines.

@rcurtin
Copy link
Member Author

rcurtin commented Mar 16, 2020

Awesome, great to see so many responses. 😄 I think we all agree that we definitely have two different kinds of users and it seems to me like it would be worthwhile to treat those users separately---everyone seems to agree (mostly?) that there should be two different repositories for these purpose.

There will be some details to work out for sure, too. Luckily we have tons of suggestions. 🎉 @shrit's suggestion for git submodules could be really nice, and there are other nice detail suggestions too all over the thread. I think if we focus on the highest-level question first, then all the implementation details will fall into place easily. :)

So, to that end, it seems like there's really only one question:

Do we choose to keep just barebones simple examples in this repository and rename it to examples? Or do we put all these very simple examples in the normal mlpack repository under an examples/ directory? (Personally, I'd vote for the first option. Like @shrit pointed out we could use a git submodule for examples/ in the main mlpack repository.)

  • First option (barebones simple examples, rename this repo to examples): then it would still be useful to provide prebuilt and pretrained models in another repository, so we can make another repository. (Also possible to use git submodule there too!)

  • Second option (simple examples go to examples/ in the mlpack repository): then I suppose we can make this repository just for prebuilt/pretrained models.

What do we think? If we can get agreement on that question then working out the details should be a bit easier. I think I captured the range of opinions in just those two options, let me know if I missed one. 😄

@kartikdutt18
Copy link
Member

Hey @rcurtin, I think the first option might make more sense, as @zoq pointed out we might add extra dependencies to the examples which are not needed to build mlpack. So I think it makes sense to have two repos, one for examples and another for pretrained / prebuilt models.

@favre49
Copy link
Member

favre49 commented Mar 17, 2020

I really like the git submodules idea, I think we should go for that.

@sriramsk1999
Copy link

I think the first option is better as well. Its only con (?) would be an increasing number of repos, but that is much better than the dependencies problem of the second option.

@zoq
Copy link
Member

zoq commented Mar 17, 2020

I agree with the general idea, but would it make more sense to create a new repo for the examples and move the code we already have over there, instead of renaming this on and create a new repo? I think most of the current PR's address the second type of users?

@sreenikSS
Copy link
Contributor

I'm a little late. Some really great ideas here. Since we are en route to a final decision, I think rounding up the details wrt the perspective of a new user would be a good idea. That being said, let's assume I am a new user of mlpack but I have knowledge about keras/pytorch:

  • I decide to create an mnist model
  • I find this models repo (let's call it models repo for now) and go to the mnist.cpp file
  • I copy-paste it into my code editor but on second thought decide to clone this repo for my convenience and for directly running it
  • <If it takes a lot of time to download and clone, the user may be irritated>
  • I open the folder and quickly configure my IDE to compile with the flags I generally use for compiling other mlpack programs
  • <If additional compilation flags are required for running it, the user may become more confused>
  • <the user now has two choices, 1. to run the mnist.cpp itself (which we would consider now) and 2. to copy relevant code to his own program>
  • I compile and run the example, see the output, become happy and write "yippee I built my first model with mlpack" on the IRC.

In case we go with @rcurtin's second option, my two cents would be to make sure that the simplicity of this repo is not affected so that starter code access and download remains as simple as it is now. For the first option of course no change in structure would be required for this repo except just renaming it.

@kartikdutt18
Copy link
Member

Hey @zoq, I think it might be a better idea to simply rename this repo and create a new one. Since the new repo would have a proper structure it might be easier to add models and other tools there once a structure is decided, rather than moving files from here to new repo and then restructuring this one. I think most active PRs can easily be shifted there. What do you think?

@zoq
Copy link
Member

zoq commented Mar 17, 2020

In case we go with @rcurtin's second option, my two cents would be to make sure that the simplicity of this repo is not affected so that starter code access and download remains as simple as it is now. For the first option of course no change in structure would be required for this repo except just renaming it.

Agreed, it should be super simple for users to run the code, but we can do that if we can keep the complexity low. I'm working on a simple notebook example, that I can hopefully show in the next days.

Hey @zoq, I think it might be a better idea to simply rename this repo and create a new one. Since the new repo would have a proper structure it might be easier to add models and other tools there once a structure is decided, rather than moving files from here to new repo and then restructuring this one. I think most active PRs can easily be shifted there. What do you think?

If that's fine with you, I think you opened most of the current PR's :)

@rcurtin
Copy link
Member Author

rcurtin commented Mar 18, 2020

Awesome, it sounds like we are pretty much agreed. If @kartikdutt18 is okay porting all his PRs then we can probably rename this repository to examples and make a new one called models. 😄 I'll let this sit for a couple of days before actually making the rename, just in case there are more comments.

To @sreenikSS's point, it seems like maybe the best idea for the examples repository is to keep things as simple as possible---perhaps even to the point where we have no CMake configuration. Maybe that mnist.cpp example is in a directory, mnist/, which has a Makefile and a README. The Makefile could have some commented out lines where you can specify include directories, etc., in case you haven't installed mlpack, and the README could contain some very basic pointers on how to build the program. (i.e., "make sure mlpack is installed, then make and run. You can change parameters in the first handful of lines of the programs to modify the dataset it runs on, etc.") I'm sure we can work out further details moving forward, but I think it will be pretty easy.

The other thing that seems good to implement is the use of git submodules for the main mlpack repository, so that it's not hard for users who found that repository but not these two to end up finding the extra models and examples. For the models repository specifically, git submodules might be really useful as they could give a way to "piggyback" on the existing CMake configuration to build bindings for other languages, and let us easily produce Python bindings for those models, and so forth. I don't know the exact details of how the best way to do that will be...

But anyway, maybe the best way forward is to make that name change, separate these two repositories, and see what we learn along the way? 😄 Unless there are any objections, I'll do the renaming/moving sometime on Thursday.

@kartikdutt18
Copy link
Member

kartikdutt18 commented Mar 18, 2020

If that's fine with you, I think you opened most of the current PR's :)

It would be great as most of my PRs are more towards the second type of users so that will make a lot of sense.

If @kartikdutt18 is okay porting all his PRs then we can probably rename this repository to examples and make a new one called models.

I am fine with it, I have closed my PRs from this repo. However #56 is more related to examples repo so I think I'll keep it open. What do you think?

Maybe that mnist.cpp example is in a directory, mnist/, which has a Makefile and a README. The Makefile could have some commented out lines where you can specify include directories, etc., in case you haven't installed mlpack, and the README could contain some very basic pointers on how to build the program. (i.e., "make sure mlpack is installed, then make and run. You can change parameters in the first handful of lines of the programs to modify the dataset it runs on, etc.")

Agreed, that would be a lot better. It would make sense to remove Kaggle/Utils as most of the functions in it are implemented in mlpack.

I'll do the renaming/moving sometime on Thursday.

Great, Looking forward to it. Thanks a lot.

@rcurtin
Copy link
Member Author

rcurtin commented Mar 20, 2020

Awesome, thank you everyone for your input on this. 🎉 Now we have two repositories:

Both were created from the same codebase, so it should be easier to adapt them.

I think we have a direction to go here. I can see some issues have already been opened to adapt this fully into an examples repository. If you see anything that should be done but hasn't been done yet feel free to open an issue or a PR. :) So far I am working on getting a first draft of an updated README for both, and setting up the utilities like mlpack-bot, labels, CI, etc.

I'll go ahead and close this issue, and open up another issue to collect the set of things to be done before we're finished transitioning this repository. I already opened one for the models repository: mlpack/models#1. 👍

Thanks everyone! 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants