Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix [LB-68] Memory Leak | Scraper #72

Closed

Conversation

pinkeshbadjatiya
Copy link
Contributor

LB-68 - importer uses lots of ram
GC takes its own time to clean up the xhr variables in closure, which is slower than the amount at which requests were happening. By manually deleting the reference, ram usage is reduced.

@gentlecat
Copy link
Contributor

Cool. Just fix the indentation please.

@pinkeshbadjatiya
Copy link
Contributor Author

I really need to work on my indentation. :)

@gentlecat
Copy link
Contributor

I suggest to use additional tool that can check all this and more for you. I don't know which editor you use, but take a look at pylint.

@mwiencek
Copy link
Member

mwiencek commented Mar 7, 2016

I don't see how this could do anything, because you can't delete local variables in JavaScript, only variables that are properties of an object. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/delete At best you can assign null to the variable.

@pinkeshbadjatiya
Copy link
Contributor Author

My earlier implementation was assigning null , but i read somewhere that delete can also do the job. Actually the idea is to reduce the reference count.

Will try some tests, otherwise revert the changes. :)

@pinkeshbadjatiya
Copy link
Contributor Author

I did some tests using the Memory Profiler in Chrome. I did it for 3 scripts, original_script, script_with_null, script_with_delete

  • The amount of memory used remained the same across all three, as expected.
  • There was not much change in the amount of memory cleaned by GC. But, memory was being cleaned at a slightly faster pace than the original implementation when using null.
  • delete did not change a thing.

I am in favour of using null.

@mwiencek
Copy link
Member

You could've just tried delete in the console to see that it doesn't do anything. ;)

> (function () { var foo = {a: 1}; delete foo; console.log(foo); }());
Object {a: 1} // foo is still referenceable, so can't possibly have been GC'd

Git uses 1tab=8spaces, while the LB code is written in 1tab=4spaces, with tab_expand enabled.
Fixed my styling issues.
@pinkeshbadjatiya
Copy link
Contributor Author

Ready for merge. :)

@alastair
Copy link
Collaborator

How many pages did you try to import? We have reports of it using lots of memory for users with over 6000 pages of data. It'd be good if you can test it on this too. Can you show screenshots of memory usage with and without the = null?

Are you also able to measure the usage using Firefox?

@pinkeshbadjatiya
Copy link
Contributor Author

I used Google Chrome memory profiler. And i tested it on the playlist http://www.last.fm/user/Self-MadeWolf/library/tracks

I tried with firefox but could not get a hang of it. :(

@pinkeshbadjatiya
Copy link
Contributor Author

I have setup a page to test the Memory Leak.
Details can be found here

Would be posting the screenshot soon, but anyone else can also test and comment their experience.

@pinkeshbadjatiya
Copy link
Contributor Author

@alastair
Here are some of the test results.

  1. Default XHR requests
  2. With =null
  3. Reusing XHR's by using a XHRPool of size 10

The above images show the frame rates during a test of submitting 500 pages each from playlist nasasie.
Inference: There is not much significant difference but it does give thicker and continuous frame rates in case2. While in case3, i am still not sure as to what happened !! Will need some expert advice. :)


Complete result dump can be found at https://github.com/pinkeshbadjatiya/memory-leak-test-ListenBrainz/tree/gh-pages/testresults

@kpmcc
Copy link

kpmcc commented Mar 21, 2016

This is kind of a wild guess from looking at the code a bit thinking of other ways to help with memory, but the recursive calls within reportScrobbles (in xhr.onload - line 177) look like they might use a fair bit of memory if a bunch of the requests were coming back with errors or otherwise failing consecutively. Not sure what the GC does with all that, but it might not get to the part where xhr is set to null until a successful request goes through.

@mwiencek
Copy link
Member

That's not actually recursive, because onload etc. are executed asynchronously in entirely different stack frames. xhr = null; is always reached, even before the request completes.

@alastair
Copy link
Collaborator

alastair commented Apr 5, 2016

I spent some time digging into this with the chrome memory profiler. Here is what I found:

  • Using the "profiles" tab, After finishing an upload of 320 pages, I had about 640 XMLHttpRequests objects which had not been garbage collected. In addition to this were a bunch of arrays and other objects containing the body sent to LB.
  • Adding xhr = null; had no change. There were still 2 XMLHttpRequest objects for each lastfm page.
  • Using the "Timeline" tab to measure memory usage showed an increasing number of event listeners on the page, which were not being GCd. We looked into http://nullprogram.com/blog/2013/02/08/, which described some problems with closures and GC. To try and fix this we moved all xhr callbacks to predefinied functions. This didn't fix our problem. We managed to reduce some of the listener count, but it kept growing
  • We had a suggestion to try using promises instead of event listeners, although the person who suggested it had a suspicion that this would have the same problem as the event listeners. He also suggested to use the new fetch api (https://fetch.spec.whatwg.org/), though this is only supported in chrome and firefox, and the shims available for IE and safari (https://github.com/github/fetch) are just wrappers around xmlhttprequest and promises, which puts us back at square 1.
  • We had a suggestion to not have the readpage->submit->readpage loop, and instead call readpage in a timer and only run when we had finished submitting previous pages (
    var timer = setInterval(function() {
    if (page > numberOfPages) {
    clearInterval(timer);
    }
    getNextPagesIfSlots();
    }, 100);
    ). This didn't seem to work
  • Finally, we applied an xhr pool (in addition to all of the above changes, using this example: http://pathfindersoftware.com/2006/08/object_pooling_/). This worked much better. Now after performing the import we have only 12 XHR objects remaining (the pool example creates more on demand if it's exhausted, but since we limit to 10 requests at once this doesn't happen very often).
    With this change, no body payload object were remaining either. The heap history on the profile tab also showed that much more data was being GCd. When importing 500 pages, Chrome usage dropped from 500MB (no pool) to 300MB (pool), from a starting point of 200MB. There is still a memory leak somewhere, but we're starting to think it might be in code that's hosted on last.fm's site. Therefore we can't do much about it. Moving to the API scraper should stop this from happening.

I want to do a few more things here:

  • Undo some of the changes we made (timer, listeners in functions not closures), but keep the pool and see if the memory usage stays the same
  • See how many pages we can get to before things start dying. See if this is higher than previous reports
  • Work out why the number of event listeners keeps climbing
  • Do this profiling in the branch that queries the lfm api instead of scraping. If there is a problem with running stuff on the lastfm website, we should see these leaks go away once we start running it on lb.org instead.

@pinkeshbadjatiya pinkeshbadjatiya deleted the memory-leak-scraper branch April 5, 2016 17:11
paramsingh added a commit that referenced this pull request Jan 19, 2020
* Add first test for request consumer

* Use python -m pytest instead of py.test

py.test doesn't add the current dir to PYTHONPATH which is why
if you try to add a test anywhere other than listenbrainz_spark/tests
you'll get an import error. this fixes that problem.

* fix definition of self

* Import create_app

* Add test for get_result
paramsingh added a commit that referenced this pull request Jan 19, 2020
* add tests for create_dataframes
modify utils to avoid prepending hdfs_cluster_uri to every path

* unit tests for create_dataframes

* unit tests for train models

* Add tests for request consumer and fix test.sh path problems (#72)

* Add first test for request consumer

* Use python -m pytest instead of py.test

py.test doesn't add the current dir to PYTHONPATH which is why
if you try to add a test anywhere other than listenbrainz_spark/tests
you'll get an import error. this fixes that problem.

* fix definition of self

* Import create_app

* Add test for get_result

* declared constant var as a class member
changed class names for consistency
changed test_get_dates... to assert difference
used assertListEqual for lists
removed wildcard imports
typos and newlines

* unit tests for candidate_sets.py and recommend.py

* defined date as an instance variable

Co-authored-by: Param Singh <iliekcomputers@gmail.com>
vansika added a commit that referenced this pull request Jan 30, 2020
* used mbids->msids mapping in create_dataframes
used pyspark API over sql queries

* fixed bad indent

* Use combined mapping (recording_artist_msid_mbid) and not recording and artist mapping independently

* Use artist_credit_artist_credit relation and recording_artist_mbid_msid mapping to generate recommendations

* vertical align pyspark functions to increase readability

* Update utils.py

* Unit test listenbrainz_spark/recommendations (#74)

* add tests for create_dataframes
modify utils to avoid prepending hdfs_cluster_uri to every path

* unit tests for create_dataframes

* unit tests for train models

* Add tests for request consumer and fix test.sh path problems (#72)

* Add first test for request consumer

* Use python -m pytest instead of py.test

py.test doesn't add the current dir to PYTHONPATH which is why
if you try to add a test anywhere other than listenbrainz_spark/tests
you'll get an import error. this fixes that problem.

* fix definition of self

* Import create_app

* Add test for get_result

* declared constant var as a class member
changed class names for consistency
changed test_get_dates... to assert difference
used assertListEqual for lists
removed wildcard imports
typos and newlines

* unit tests for candidate_sets.py and recommend.py

* defined date as an instance variable

Co-authored-by: Param Singh <iliekcomputers@gmail.com>

* upload mappings, artist relation and listens to HDFS (#77)

* download listens and mapping from FTP
upload listens and mapping to HDFS

* calculate time taken to download files from ftp

* correct listenbrainz listens dump path on ftp

* add misiing import and remove extra func arg

* upload and download script for artist relation

* add *force* utility to delete existing data

* rectify name of imports, delete unused import files

* update tests and recommendation engine with mdis_mbid_mapping

* use NotImplementedException to catch null callback

* Add archival warning

* Unit tests for HDFS/FTP module (#698)

* download listens and mapping from FTP
upload listens and mapping to HDFS

* calculate time taken to download files from ftp

* correct listenbrainz listens dump path on ftp

* add misiing import and remove extra func arg

* upload and download script for artist relation

* add *force* utility to delete existing data

* rectify name of imports, delete unused import files

* update tests and recommendation engine with mdis_mbid_mapping

* use NotImplementedException to catch null callback

* tests for ftp downloader and hdfs uploader for mapping, listens, artist relations

* update func name in utils and add test for it

* Fix startup script of spark-request-consumer

* add pxz.wait to avoid race condiiton
modified utils.create_dataframe to warp spark row object  in list

* improve function names

* define constants on top of file and import them in tests

Co-authored-by: Param Singh <iliekcomputers@gmail.com>

* Attributes for candidate recording dumps (#710)

* get attributes for candidate recoridngs dump

* update unit tests with changes in recommed.py

* typo in function name

* import utils as module and not attribute

Co-authored-by: Param Singh <iliekcomputers@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants