Restructuring code for models repo. #60

kartikdutt18 · 2020-03-09T14:38:34Z

Hi everyone, Here is the list of current changes:

Implemented models (DigitRecognizer and LeNet as classes that can be included.)
Added DataLoader (Currently supports only mnist).
An examples folder to for mnist called mint_tutorial.
Mnist dataset can be downloaded using -DDOWNLOAD_MNIST or -DDOWNLOAD_DATASETS
All examples use ensamllen.
Unified Utils folder (Currently same as Kaggle_utils, Need cleaning up before I add LSTM.

Also closes #40, right now it partially handles #61, and aims to resolve #57.
I hope it would be okay to alexnet here as well.
I'll also add support for augmentation, Better utils, weights and test in the next 1 / 2 days. I would really appreciate your feedback.

Future Issues:
Added Augmentation.
Better DataLoader (support for selecting start and end columns)

Hi @zoq, @favre49, if you get a chance could please review this. I'll be making more changes to download weights and tests soon.

kartikdutt18 · 2020-03-09T14:39:44Z

I have also tested LeNet4, LeNet1 and simpleNN for 3 epochs each. They worked fine on machine generating 89%, 94% and 92% accuracy respectively on validation dataset.

Some Minor Fixes LeNet switch Add-support for downloading datasets Transfered DigitRecognizer completely

prince776 · 2020-03-10T10:29:38Z

Hi @kartikdutt18 I think the implementation is really good. I checked the LeNet code and I have one suggestion, that is to add the option of includeTop. I'm unsure if it's needed in LeNet, but it's needed in models like VGG. So it will make models architecture consistent.
Let me know what you think about it.
I'm also looking forward to adding Inception to it.

kartikdutt18 · 2020-03-10T10:32:54Z

Hmm, I think the purpose of LeNet's paper was fully connected layer at the after Convolutions so if I remove it doesn't become a LeNet anymore however I don't mind adding that to be more consistent with the code base.

kartikdutt18 · 2020-03-10T10:35:49Z

Also @prince776, I probably skipped around some comments so if you notice something, could you review it and leave a comment saying that we should add this line here or something along similar lines. Thanks.

prince776 · 2020-03-10T10:43:46Z

Also @prince776, I probably skipped around some comments so if you notice something, could you review it and leave a comment saying that we should add this line here or something along similar lines. Thanks.

Ok, I'll do that.

prince776

Minor comment errors mentioned and a minor suggestion. Thanks.

dataloaders/dataloader.hpp

examples/mnist_tutorial/CMakeLists.txt

examples/mnist_tutorial/mnist_tutorial_cnn.cpp

models/lenet/lenet.hpp

models/simple_nn/simple_nn.hpp

utils/CMakeLists.txt

kartikdutt18 · 2020-03-10T11:03:47Z

Hi @prince776, Thanks for the review, I made constant changes to the code without changing comments so this is what we ended up with. Will fix them in the next commit. I hope you could take a look then too. Thanks.

ojhalakshya · 2020-03-10T11:22:27Z

Hi @kartikdutt18
Could you add more details about these classes.
Thanks.

models/simple_nn/simple_nn.hpp

prince776 · 2020-03-10T12:29:52Z

Hi @prince776, Thanks for the review, I made constant changes to the code without changing comments so this is what we ended up with. Will fix them in the next commit. I hope you could take a look then too. Thanks.

Yes sure. I'm really excited to work on it anyways. I'll keep reviewing to make sure I'm up to date with the codebase.

kartikdutt18 · 2020-03-10T13:52:34Z

Great, I don't think I'll be able to push a commit today. I am going with a complete redesign of DataLoader so I'll have to make changes for it accordingly. In the next commit I'll have restricted LSTMs, removed csv files completely from the repo and a better DataLoader.

kartikdutt18 · 2020-03-10T13:53:25Z

Hi @kartikdutt18
Could you add more details about these classes.
Thanks.

Will do. Some of the portion of the code needs to heavily documented. Thanks.

kartikdutt18 · 2020-03-12T17:08:07Z

I haven't tested these changes, they probably require some debugging, but I spent yesterday making a DataLoader that ended up deleting so I made some changes to make progress. I'll post the result tomorrow after the fixes. Hopefully I'll also transfer VAE, and add tests tomorrow.

kartikdutt18 · 2020-03-12T17:11:10Z

Adding a todo list here to keep track of changes:

Models should be included as libraries / classes (sample can be seen in PR Adding AlexNet #50). So a user can call models/alexnet.hpp and use alexnet as base network for something else.
DataLoaders Folder. (Needs to be heavily improved though)
A unified samples foder.
A testing folder (this will be for contributors) where they will add test to load weights train for 5 epochs and set a reasonable accuracy as baseline. This will ensure addition of valid models as well as streamline addition of models.
All data in one folder. I would however prefer if we made changes in cmake to download data rather than storing it in repo to make models repo lighter.
Weights folder in models folder where weights of each models is stored.

New Dataloader Transfered LSTMS Spacing issue fixed. Add empty line

kartikdutt18 · 2020-03-12T17:33:58Z

Hey @zoq, Could you please review the DataLoader and the overall representation, I wanted to know what would be the appropriate way of implementing a DataLoader such that more datasets can be added without hassle.

jeffin143

This is more of some question to understand the implementation details

jeffin143 · 2020-03-13T05:12:53Z

dataloaders/dataloader.hpp

+  void MNISTDataLoader();
+
+  // Google Stock Prices Dataloader.
+  void GoogleStockPricesDataloader();
+


I am not sure, why 3 different loaders, Can't we make one generic, Sorry I might have missed something

We need at least two data loaders, one for Timeseries and other for datasets like CSV, I am working on implementing a better data loader as mentioned above. If you have any suggestions that would be great. Currently I'm thinking of implementing LoadTrainCSV and LoadTestCV which will be overloaded to support all CSV related datasets.

Sorry, but can you briefly explain me why do we need at least two Data loaders ?

So time series dataset would need a parameter rho (look back) and modify data to hold past values, we would have to store data in cubes where as for other CSV datasets matrices are sufficient. So what I was planning to implement was we could have a LoadTrainCSV overloaded so that we can take rho if needed and store data in cubes. Here There would also be no shuffling of data. From this we can call the other LoadTrainCSV to perform tasks that would be similar. I might end up with 1 data loader also but I think I would know for sure only after I start making the changes.
I am sorry, I know this isn't really coherent but I think after I code it I can much better sense. I'll try to reduce redundant code as soon as possible.

No issues, with it :). I am on it totally, I was just narrowing down the implementation.

We can call this v1, and I hope v2 and v3 will be much better 😉

I will try my best and hopefully v2 will be much better than what it currently is. 👍

jeffin143 · 2020-03-13T05:14:24Z

dataloaders/dataloader_impl.hpp

+    trainX = train.submat(1, 0, train.n_rows - 1, train.n_cols - 1) / 255.0;
+    validX = valid.submat(1, 0, valid.n_rows - 1, valid.n_cols - 1) / 255.0;


you can use mlpack internal scaler here, I guess

Hmm, then we would also need to store the scaler and then this would have to be done for all other datasets and that would just reduce flexibility. For Mnist we can use this. For other data loaders I have getters / setters to scale the data so user could call them to Fit and transform the data.

I am not sure if that would reduce flexibility, I guess that will allow user with more features since he can opt in for scaling.

Hmm, I will keep that in mind while making a better data loader. The current data loader allowed me to run some tests since there a lot of changes. I'll push a commit with a better data loader and you could take a look at that.

If you don't mind can I work on your branch

ps : I just understood what are actual plan is

I think for the plan you can refer to this comment. I this is would be a very important PR for all future models that are added as we need to clearly lay down how a model, a dataset is to be added so it also ends up streamlining the flow. Much like mlpack, where we have a clear set of rules that we need to follow to add a feature hopefully through this PR we could achieve the same for this repo. Thanks.

Sure, just one thing I need to mention after my last commit, I get the errors of either segfault or boost::failed assert. I would probably need to fix that first, then I could work on the data loader and I would love to get your help if you get a chance. Thanks.

I really liked the idea, It was on my todo list, Something of transfer learning and the whole idea. I will take a look over this coming weekend

I agree, I think this would be very useful especially for training large models. We could train without top (Add it outside model class) and store the weights. Then we could use these weights in a larger model like YOLO. This should decrease training epochs required hence reduce training time.

Something similar to ladder training.

Yayy, So I fixed the build, testing on different OS now.

jeffin143 · 2020-03-13T05:19:11Z

dataloaders/dataloader.hpp

+  DataSetY &ValidY() { return validY; }
+
+  //! Get the Training Dataset.
+  arma::mat TrainCSVData() const { return trainCSVData; }


How is it different from the follwing in line 62

//! Modify the Training Dataset. DataSetX &TrainX() { return trainX; }```

Ohk, so if user decides to store augmented data without splitting it into X and Y, he can access it using this. TrainX and Train Y are useful for classification only.

Why can't we return them using a join_row(), joining TrainX and TrainY , why a special member ?

I am not sure it would work for cubes, I can try it though, but yes I think the dataloader needs to be improved a lot. This is currently in a working state only for now.

I will take a look

kartikdutt18 · 2020-03-13T07:09:15Z

Hey @jeffin143, any suggestions for CLI?

prince776

Looks mostly good to me. Some minor changes required mentioned.

prince776 · 2020-03-13T07:57:57Z

dataloaders/dataloader_impl.hpp

+      dataset.n_cols - 1));
+}
+
+#endif


Extra line needed at end of the file.

prince776 · 2020-03-13T07:58:23Z

examples/time_series_tutorial/google_stock_price_tutorial.cpp

+  arma::cube preds;
+  model.Predict(dataloader.TestX(), preds);
+  cout << MSE(preds, dataloader.TestY()) << endl;
+}


extra line needed at eof needed.

prince776 · 2020-03-13T07:59:51Z

dataloaders/dataloader.hpp

+   * @param datasetPath Path or name of dataset.
+   * @param shuffle whether or not to shuffle the data.
+   * @param ratio Ratio for train-test split.
+   * @param dropHeader Drops first row if true.


This parameter is not present in the constructor.

prince776 · 2020-03-13T08:02:05Z

dataloaders/dataloader.hpp

+
+  //! Locally stored augmented probability.
+  double augmentationProbability;
+};


a comment like // class DataLoader should be added here after };.

prince776 · 2020-03-13T08:04:07Z

dataloaders/dataloader.hpp

+  typename DataSetX = arma::mat,
+  typename DataSetY = arma::mat
+>
+class DataLoader


I think we should add all these classes within a namespace. like mlpack::dataloaders.

prince776 · 2020-03-13T08:04:38Z

examples/time_series_tutorial/electricity_consumption_tutorial.cpp

+  arma::cube preds;
+  model.Predict(dataloader.TestX(), preds);
+  cout << MSE(preds, dataloader.TestY()) << endl;
+}


extra line at eof needed.

kartikdutt18 · 2020-03-13T08:10:59Z

Thanks @prince776, I will make the changes. Also there is a typo in the data loader that I will fix in the next commit, I spent hours using valgrind to see why it segfaults.

jeffin143 · 2020-03-13T12:18:32Z

dataloaders/dataloader_impl.hpp

+  trainCSVData = dataset.submat(arma::span(),arma::span(0, (1 - ratio) *
+       dataset.n_cols));
+  testCSVData = dataset.submat(arma::span(), arma::span((1 - ratio) * dataset.n_cols,
+       dataset.n_cols - 1));


data::split() can be used here, without shuffling

Yay!! Thanks for the PR. I will hopefully push a new data loader by tonight.

Raised a #2293

kartikdutt18 · 2020-03-14T16:21:24Z

Hey @jeffin143, sorry to bother you again but do you think the new data loader is better that what it initially was?
I haven't added comments yet, I also have to add support for google-stock-prices dataset but in general does seem like a better idea?

jeffin143

WDYT

jeffin143 · 2020-03-14T16:38:46Z

dataloaders/dataloader.hpp

+
+  void LoadTrainCSV(const std::string &datasetPath,
+                    const bool shuffle,
+                    const double ratio = 0.75,
+                    const bool useScaler = false,
+                    const bool dropHeader = false,
+                    const int startInputFeatures = -1,
+                    const int endInputFeatures = -1,
+                    const int startPredictionFeatures = -1,
+                    const int endPredictionFeatures = -1,
+                    const std::vector<std::string> augmentation =
+                        std::vector<std::string>(),
+                    const double augmentationProbability = 0.2);
+
+  void LoadTrainCSV(const std::string &datasetPath,
+                    const double ratio = 0.75,
+                    const int rho = 10,
+                    const bool useScaler = false,
+                    const bool dropHeader = false,
+                    const int startInputFeatures = -1,
+                    const int endInputFeatures = -1,
+                    const size_t inputSize = 1,
+                    const size_t outputSize = 1);
+
+  void LoadTestCSV(const std::string &datasetPath,
+                   const bool useScaler = false,
+                   const bool dropHeader = false,
+                   const int startInputFeatures = -1,
+                   const int endInputFeatures = -1);
+
+  void LoadTestCSV(const std::string &datasetPath,
+                   const int rho = 10,
+                   const bool useScaler = false,
+                   const bool dropHeader = false,
+                   const int startInputFeatures = -1,
+                   const int endInputFeatures = -1,
+                   const size_t inputSize = 1,
+                   const size_t outputSize = 1);


@kartikdutt18 , I just took a small glance, I guess we don't need all these mutiple functions

May be just use LoadCsv() function

And the if user wants he can call this function two times
LoadCSv(input, "mytest") -> For testCSV
LoadCsv(input_train, "Mytrain") -> for trainCSV

or if he isn't having the data in splited format ,then he can call the function to load data and split it too

LoadCsv ( input, "data", split = True ) ForTestTrainCSV

Not sure, This is a vague idea, I am thinking aloud, And need probably to write down , I was currently working on something and hence did dwell much, Will take a look tonight

I can do that but I would end up adding if conditions which aren't clean but I do get your point and I don't mind doing that. I think I can add another option time_series = true / false To remove remaining overloaded functions.

prince776

Just some little problems I found while working with models.

models/lenet/lenet.hpp

kartikdutt18 · 2020-03-19T07:09:57Z

Closing this, It will be shifted to the new repo.

mlpack-bot bot added s: needs review s: unanswered s: unlabeled labels Mar 9, 2020

kartikdutt18 mentioned this pull request Mar 10, 2020

What is this repository for? #61

Closed

Works without cmake (Next step : write clean mordern cmake)

da62c6b

Some Minor Fixes LeNet switch Add-support for downloading datasets Transfered DigitRecognizer completely

kartikdutt18 force-pushed the modelsv2 branch from cadc73c to da62c6b Compare March 10, 2020 06:52

prince776 reviewed Mar 10, 2020

View reviewed changes

ojhalakshya reviewed Mar 10, 2020

View reviewed changes

models/simple_nn/simple_nn.hpp Show resolved Hide resolved

birm added t: added feature and removed s: unlabeled labels Mar 11, 2020

simplify models code

1f281e7

New Dataloader Transfered LSTMS Spacing issue fixed. Add empty line

kartikdutt18 force-pushed the modelsv2 branch from c6a186e to 1f281e7 Compare March 12, 2020 17:13

jeffin143 reviewed Mar 13, 2020

View reviewed changes

prince776 reviewed Mar 13, 2020

View reviewed changes

Build Fixed

c8a5db2

jeffin143 reviewed Mar 13, 2020

View reviewed changes

jeffin143 mentioned this pull request Mar 13, 2020

Add shuffle data paramter to data_split mlpack/mlpack#2293

Merged

kartikdutt18 added 2 commits March 13, 2020 22:05

Remove dataloader, sq

55ebd20

Dataloaderv2

ff05bf0

jeffin143 reviewed Mar 14, 2020

View reviewed changes

kartikdutt18 mentioned this pull request Mar 14, 2020

Adding ZFNet #63

Closed

prince776 reviewed Mar 15, 2020

View reviewed changes

models/lenet/lenet.hpp Show resolved Hide resolved

models/lenet/lenet.hpp Show resolved Hide resolved

kartikdutt18 mentioned this pull request Mar 18, 2020

Convolution<> Layer cauing Segmentation Fault( core dumped) error. mlpack/mlpack#2310

Closed

kartikdutt18 closed this Mar 19, 2020

		trainX = train.submat(1, 0, train.n_rows - 1, train.n_cols - 1) / 255.0;
		validX = valid.submat(1, 0, valid.n_rows - 1, valid.n_cols - 1) / 255.0;

Restructuring code for models repo. #60

Restructuring code for models repo. #60

Conversation

kartikdutt18 commented Mar 9, 2020 • edited

kartikdutt18 commented Mar 9, 2020 • edited

prince776 commented Mar 10, 2020

kartikdutt18 commented Mar 10, 2020

kartikdutt18 commented Mar 10, 2020

prince776 commented Mar 10, 2020

prince776 left a comment • edited

Choose a reason for hiding this comment

kartikdutt18 commented Mar 10, 2020

ojhalakshya commented Mar 10, 2020

prince776 commented Mar 10, 2020

kartikdutt18 commented Mar 10, 2020

kartikdutt18 commented Mar 10, 2020

kartikdutt18 commented Mar 12, 2020

kartikdutt18 commented Mar 12, 2020 • edited

kartikdutt18 commented Mar 12, 2020 • edited

jeffin143 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffin143 Mar 13, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikdutt18 commented Mar 13, 2020

prince776 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikdutt18 commented Mar 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikdutt18 commented Mar 14, 2020

jeffin143 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prince776 left a comment

Choose a reason for hiding this comment

kartikdutt18 commented Mar 19, 2020

kartikdutt18 commented Mar 9, 2020 •

edited

kartikdutt18 commented Mar 9, 2020 •

edited

prince776 left a comment •

edited

kartikdutt18 commented Mar 12, 2020 •

edited

kartikdutt18 commented Mar 12, 2020 •

edited

jeffin143 Mar 13, 2020 •

edited