Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memory benchmarks #142

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

braceletboy
Copy link
Contributor

@braceletboy braceletboy commented Dec 16, 2019

Hi @zoq and @rcurtin , pardon me for contributing to this repository after a long break. So, the last time we talked about benchmarking memory usage of the machine learning algorithms on the IRC. So, I have made some changes to the repository to do this, which I am presenting here. I made some minor fixes in some of the initial commits. The main commits corresponding to memory benchmarking are c33047a and db9df17. Let me know what you think of these changes.

The f1ee3aa commit was made because running make setup doesn't install the packages directly and can throw errors if dependencies are not satisfied. Also, I don't think that these dependencies apply to all kinds of Linux systems. Mine is Ubuntu 18.04, and I have limited knowledge of other systems to talk about these dependencies confidently.

PS: This is not the final pull request. Feedback is needed.

Added all the competing libraries to the list in the README. Added
additional dependencies that are required for running 'make setup'
directly. May not be an exhaustive listing of dependencies. Cross
checking with different types of platforms required.
Changes to be committed:
	modified:   README.md
Changes to be committed:
	modified:   libraries/ann_install.sh
	modified:   libraries/annoy_install.sh
	modified:   libraries/dlibml_install.sh
	modified:   libraries/download_packages.sh
	modified:   libraries/dtimeout_install.sh
	modified:   libraries/flann_install.sh
	modified:   libraries/hlearn_install.sh
	modified:   libraries/install_all.sh
	modified:   libraries/milk_install.sh
	modified:   libraries/mlpack_install.sh
	modified:   libraries/mlpy_install.sh
	modified:   libraries/mrpt_install.sh
	modified:   libraries/nearpy_install.sh
	modified:   libraries/r_install.sh
	modified:   libraries/scikit_install.sh
	modified:   libraries/weka_install.sh
Shogun's script install python3.5 interface for shogun. Now it's been
generalized to install for systems python3 version.

Changes to be committed:
	modified:   libraries/shogun_install.sh
Changes to be committed:
	new file:   libraries/mprofiler_install.sh
	modified:   libraries/package-urls.txt
	modified:   libraries/install_all.sh
Changes to be committed:
	modified:   run.py
@@ -77,7 +78,13 @@ def run_timeout_wrapper():
logging.info('Run: %s' % (str(instance)))

# Run the metric method.
result = instance.metric()
mem_use, result = memory_usage(instance.metric,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does mprofiler track a process that is called with subprocess as well? Or does it only track the python process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It can track the processes spawned using subprocess. The include_children option here in the function is specifically for this purpose. I have also gone through the source code of memory_profiler a little bit. So, I am sure of it. I will leave a link to it here.

@zoq
Copy link
Member

zoq commented Dec 18, 2019

Thanks for looking into this, excited to see some first results.

@braceletboy
Copy link
Contributor Author

braceletboy commented Dec 19, 2019

@zoq So, I have some questions regarding the memory benchmarking system.

  1. The memory_usage function returns the max. memory usage (when the max_usage argument is True) of the current process we are benchmarking and also it's children (when the include_children argument is True). In this, we are taking into account, the memory assigned to the process as a whole. This might include the memory assigned by other parts of the program (the benchmarking support structure in run.py), the memory used for data. So, I have two sub-questions in this regard:

a) Should the memory usage for the data be included?
b) Should the memory usage for the support structure be included? (which seems like an obvious no. During some initial analysis, I used the psutil's memory_info function to the memory underuse just before the loading of the benchmark script in run.py. It seems to vary from ml algorithm to another but seems to be considerable. Around 30-60MB .)

  1. After some careful analysis of the memory_profiler code, I came to understand that the memory_usage function benchmarks the memory by spawning a child process that tracks the memory of the parent process by using the psutil's memory_info function at regular intervals. This might mean that when we use the memory_usage function with include_children=True, we are also including the memory used for this child process that does the tracking. Will this be an issue?

@zoq
Copy link
Member

zoq commented Dec 21, 2019

a) Should the memory usage for the data be included?

For the runtime benchmarks, we do not include data loading as the focus is on the method runtime; I think for the memory benchmarks we should do the same.

For the data loading part, some libs support different formats like loading binary, which are often faster or some libs do some data encoding which isn't part of the method itself, so to do a fair comparison it makes sense to exclude the data loading part.

b) Should the memory usage for the support structure be included?

If we can avoid that, which I think we can if we put the memory benchmark inside the method script we should do that.

After some careful analysis of the memory_profiler code, I came to understand that the memory_usage function benchmarks the memory by spawning a child process that tracks the memory of the parent process by using the psutil's memory_info function at regular intervals. This might mean that when we use the memory_usage function with include_children=True, we are also including the memory used for this child process that does the tracking. Will this be an issue?

Actually, I think we want to track child processes as well, to track methods that split the process into multiple children.

@braceletboy
Copy link
Contributor Author

braceletboy commented Dec 22, 2019

@zoq

For the runtime benchmarks, we do not include data loading as the focus is on the method runtime; I think for the memory benchmarks we should do the same.

I felt the same initially, but then I realized that when the benchmark scripts use subprocess, I had no way to separate the memory used for data loading from memory used for the actual algorithm. See line 47 and line 66 of the mlpack's allkfn benchmark; the data only gets loaded inside the subprocess that runs the mlpack_kfn binary with the given configured options. The same holds for MATLAB, R, ann, dibml, flann - whose scripts use subprocess.

So,
a) Do you think there is a way to separate the memory used for data loading and the memory used for the algorithm when we use subprocess?
b) Also, isn't it unfair for these libraries when the data loading is getting included in their runtime benchmark?

Actually, I think we want to track child processes as well, to track methods that split the process into multiple children.

Cool.

@zoq
Copy link
Member

zoq commented Dec 22, 2019

I felt the same initially, but then I realized that when the benchmark scripts use subprocess, I had no way to separate the memory used for data loading from memory used for the actual algorithm. See line 47 and line 66 of the mlpack's allkfn benchmark; the data only gets loaded inside the subprocess that runs the mlpack_kfn binary with the given configured options. The same holds for MATLAB, R, ann, dibml, flann - whose scripts use subprocess.

I see that is tricky, in the past I used valgrind massif to track the memory consumption, so maybe we can do something similar? That would include the data loading part as well, but a user could check the results and manually filter out the data loading part? On a second thought, I think I even like to include the data loading part, as we want to show what amount of memory is used by a specifc method, each lib has to hold the data somehow, the format shouldn't matter and if some lib uses e.g. a sparse matrix we should account for that.

Also, isn't it unfair for these libraries when the data loading is getting included in their runtime benchmark?

We time the data loading/saving part and subtract it from the overall runtime; see:

metric["runtime"] = timer["total_time"] - timer["loading_data"] - timer["saving_data"]

for an example. We do the same for MATLAB , R, etc.

@braceletboy
Copy link
Contributor Author

braceletboy commented Jan 13, 2020

I see that is tricky, in the past I used valgrind massif to track the memory consumption, so maybe we can do something similar? That would include the data loading part as well, but a user could check the results and manually filter out the data loading part? On a second thought, I think I even like to include the data loading part, as we want to show what amount of memory is used by a specific method, each lib has to hold the data somehow, the format shouldn't matter and if some lib uses e.g. a sparse matrix we should account for that.

So we have the following options:

  1. Track the memory consumption in such a way that we can separate out the data loading part and give the option of including it to the user. I like this idea but it needs to be seen how this can be done. I will look into valgrind massif.
  2. Include all the memory consumption into one. This means that the above code doesn't need any changes? Let me know :)

We time the data loading/saving part and subtract it from the overall runtime; see:

metric["runtime"] = timer["total_time"] - timer["loading_data"] - timer["saving_data"]

for an example. We do the same for MATLAB , R, etc.

That's great. I didn't see that line. :)

@zoq
Copy link
Member

zoq commented Jan 13, 2020

Personally I would go with option 2, as holding the data is part of how a method performances memory wise. We might want to take a look into massif anyways, as it might return provide some more useful informations. Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants