Fix [LB-68] Memory Leak | Scraper #72

pinkeshbadjatiya · 2016-03-07T14:42:53Z

- importer uses lots of ram
GC takes its own time to clean up the xhr variables in closure, which is slower than the amount at which requests were happening. By manually deleting the reference, ram usage is reduced.

gentlecat · 2016-03-07T15:58:30Z

Cool. Just fix the indentation please.

pinkeshbadjatiya · 2016-03-07T16:11:20Z

I really need to work on my indentation. :)

gentlecat · 2016-03-07T16:23:57Z

I suggest to use additional tool that can check all this and more for you. I don't know which editor you use, but take a look at pylint.

mwiencek · 2016-03-07T16:30:00Z

I don't see how this could do anything, because you can't delete local variables in JavaScript, only variables that are properties of an object. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/delete At best you can assign null to the variable.

pinkeshbadjatiya · 2016-03-07T16:45:08Z

My earlier implementation was assigning null , but i read somewhere that delete can also do the job. Actually the idea is to reduce the reference count.

Will try some tests, otherwise revert the changes. :)

pinkeshbadjatiya · 2016-03-12T21:25:35Z

I did some tests using the Memory Profiler in Chrome. I did it for 3 scripts, original_script, script_with_null, script_with_delete

The amount of memory used remained the same across all three, as expected.
There was not much change in the amount of memory cleaned by GC. But, memory was being cleaned at a slightly faster pace than the original implementation when using null.
delete did not change a thing.

I am in favour of using null.

mwiencek · 2016-03-12T21:51:44Z

You could've just tried delete in the console to see that it doesn't do anything. ;)

> (function () { var foo = {a: 1}; delete foo; console.log(foo); }());
Object {a: 1} // foo is still referenceable, so can't possibly have been GC'd

Git uses 1tab=8spaces, while the LB code is written in 1tab=4spaces, with tab_expand enabled. Fixed my styling issues.

pinkeshbadjatiya · 2016-03-13T05:45:36Z

Ready for merge. :)

alastair · 2016-03-18T09:58:42Z

How many pages did you try to import? We have reports of it using lots of memory for users with over 6000 pages of data. It'd be good if you can test it on this too. Can you show screenshots of memory usage with and without the = null?

Are you also able to measure the usage using Firefox?

pinkeshbadjatiya · 2016-03-18T10:19:27Z

I used Google Chrome memory profiler. And i tested it on the playlist http://www.last.fm/user/Self-MadeWolf/library/tracks

I tried with firefox but could not get a hang of it. :(

pinkeshbadjatiya · 2016-03-19T17:09:21Z

I have setup a page to test the Memory Leak.
Details can be found here

Would be posting the screenshot soon, but anyone else can also test and comment their experience.

pinkeshbadjatiya · 2016-03-19T21:47:07Z

@alastair
Here are some of the test results.

Default XHR requests
With =null
Reusing XHR's by using a XHRPool of size 10

The above images show the frame rates during a test of submitting 500 pages each from playlist nasasie.
Inference: There is not much significant difference but it does give thicker and continuous frame rates in case2. While in case3, i am still not sure as to what happened !! Will need some expert advice. :)

Complete result dump can be found at https://github.com/pinkeshbadjatiya/memory-leak-test-ListenBrainz/tree/gh-pages/testresults

kpmcc · 2016-03-21T04:18:22Z

This is kind of a wild guess from looking at the code a bit thinking of other ways to help with memory, but the recursive calls within reportScrobbles (in xhr.onload - line 177) look like they might use a fair bit of memory if a bunch of the requests were coming back with errors or otherwise failing consecutively. Not sure what the GC does with all that, but it might not get to the part where xhr is set to null until a successful request goes through.

mwiencek · 2016-03-21T04:25:32Z

That's not actually recursive, because onload etc. are executed asynchronously in entirely different stack frames. xhr = null; is always reached, even before the request completes.

alastair · 2016-04-05T14:16:34Z

I spent some time digging into this with the chrome memory profiler. Here is what I found:

Using the "profiles" tab, After finishing an upload of 320 pages, I had about 640 XMLHttpRequests objects which had not been garbage collected. In addition to this were a bunch of arrays and other objects containing the body sent to LB.
Adding xhr = null; had no change. There were still 2 XMLHttpRequest objects for each lastfm page.
Using the "Timeline" tab to measure memory usage showed an increasing number of event listeners on the page, which were not being GCd. We looked into http://nullprogram.com/blog/2013/02/08/, which described some problems with closures and GC. To try and fix this we moved all xhr callbacks to predefinied functions. This didn't fix our problem. We managed to reduce some of the listener count, but it kept growing
We had a suggestion to try using promises instead of event listeners, although the person who suggested it had a suspicion that this would have the same problem as the event listeners. He also suggested to use the new fetch api (https://fetch.spec.whatwg.org/), though this is only supported in chrome and firefox, and the shims available for IE and safari (https://github.com/github/fetch) are just wrappers around xmlhttprequest and promises, which puts us back at square 1.

We had a suggestion to not have the readpage->submit->readpage loop, and instead call readpage in a timer and only run when we had finished submitting previous pages (

listenbrainz-server/webserver/templates/user/scraper.js

Lines 277 to 282 in 3868c33

    
           var timer = setInterval(function() { 
        
               if (page > numberOfPages) { 
        
                   clearInterval(timer); 
        
               } 
        
               getNextPagesIfSlots(); 
        
           }, 100);

). This didn't seem to work

Finally, we applied an xhr pool (in addition to all of the above changes, using this example: http://pathfindersoftware.com/2006/08/object_pooling_/). This worked much better. Now after performing the import we have only 12 XHR objects remaining (the pool example creates more on demand if it's exhausted, but since we limit to 10 requests at once this doesn't happen very often).
With this change, no body payload object were remaining either. The heap history on the profile tab also showed that much more data was being GCd. When importing 500 pages, Chrome usage dropped from 500MB (no pool) to 300MB (pool), from a starting point of 200MB. There is still a memory leak somewhere, but we're starting to think it might be in code that's hosted on last.fm's site. Therefore we can't do much about it. Moving to the API scraper should stop this from happening.

I want to do a few more things here:

Undo some of the changes we made (timer, listeners in functions not closures), but keep the pool and see if the memory usage stays the same
See how many pages we can get to before things start dying. See if this is higher than previous reports
Work out why the number of event listeners keeps climbing
Do this profiling in the branch that queries the lfm api instead of scraping. If there is a problem with running stuff on the lastfm website, we should see these leaks go away once we start running it on lb.org instead.

* Add first test for request consumer * Use python -m pytest instead of py.test py.test doesn't add the current dir to PYTHONPATH which is why if you try to add a test anywhere other than listenbrainz_spark/tests you'll get an import error. this fixes that problem. * fix definition of self * Import create_app * Add test for get_result

* add tests for create_dataframes modify utils to avoid prepending hdfs_cluster_uri to every path * unit tests for create_dataframes * unit tests for train models * Add tests for request consumer and fix test.sh path problems (#72) * Add first test for request consumer * Use python -m pytest instead of py.test py.test doesn't add the current dir to PYTHONPATH which is why if you try to add a test anywhere other than listenbrainz_spark/tests you'll get an import error. this fixes that problem. * fix definition of self * Import create_app * Add test for get_result * declared constant var as a class member changed class names for consistency changed test_get_dates... to assert difference used assertListEqual for lists removed wildcard imports typos and newlines * unit tests for candidate_sets.py and recommend.py * defined date as an instance variable Co-authored-by: Param Singh <iliekcomputers@gmail.com>

* used mbids->msids mapping in create_dataframes used pyspark API over sql queries * fixed bad indent * Use combined mapping (recording_artist_msid_mbid) and not recording and artist mapping independently * Use artist_credit_artist_credit relation and recording_artist_mbid_msid mapping to generate recommendations * vertical align pyspark functions to increase readability * Update utils.py * Unit test listenbrainz_spark/recommendations (#74) * add tests for create_dataframes modify utils to avoid prepending hdfs_cluster_uri to every path * unit tests for create_dataframes * unit tests for train models * Add tests for request consumer and fix test.sh path problems (#72) * Add first test for request consumer * Use python -m pytest instead of py.test py.test doesn't add the current dir to PYTHONPATH which is why if you try to add a test anywhere other than listenbrainz_spark/tests you'll get an import error. this fixes that problem. * fix definition of self * Import create_app * Add test for get_result * declared constant var as a class member changed class names for consistency changed test_get_dates... to assert difference used assertListEqual for lists removed wildcard imports typos and newlines * unit tests for candidate_sets.py and recommend.py * defined date as an instance variable Co-authored-by: Param Singh <iliekcomputers@gmail.com> * upload mappings, artist relation and listens to HDFS (#77) * download listens and mapping from FTP upload listens and mapping to HDFS * calculate time taken to download files from ftp * correct listenbrainz listens dump path on ftp * add misiing import and remove extra func arg * upload and download script for artist relation * add *force* utility to delete existing data * rectify name of imports, delete unused import files * update tests and recommendation engine with mdis_mbid_mapping * use NotImplementedException to catch null callback * Add archival warning * Unit tests for HDFS/FTP module (#698) * download listens and mapping from FTP upload listens and mapping to HDFS * calculate time taken to download files from ftp * correct listenbrainz listens dump path on ftp * add misiing import and remove extra func arg * upload and download script for artist relation * add *force* utility to delete existing data * rectify name of imports, delete unused import files * update tests and recommendation engine with mdis_mbid_mapping * use NotImplementedException to catch null callback * tests for ftp downloader and hdfs uploader for mapping, listens, artist relations * update func name in utils and add test for it * Fix startup script of spark-request-consumer * add pxz.wait to avoid race condiiton modified utils.create_dataframe to warp spark row object in list * improve function names * define constants on top of file and import them in tests Co-authored-by: Param Singh <iliekcomputers@gmail.com> * Attributes for candidate recording dumps (#710) * get attributes for candidate recoridngs dump * update unit tests with changes in recommed.py * typo in function name * import utils as module and not attribute Co-authored-by: Param Singh <iliekcomputers@gmail.com>

Fix [LB-68] Memory Leak | Scraper

50bef41

pinkeshbadjatiya force-pushed the memory-leak-scraper branch from bbbf3bd to 50bef41 Compare March 7, 2016 16:18

Change to null after performing tests

0fd2c54

Update indentation

43b7d8c

Git uses 1tab=8spaces, while the LB code is written in 1tab=4spaces, with tab_expand enabled. Fixed my styling issues.

pinkeshbadjatiya closed this Apr 5, 2016

pinkeshbadjatiya deleted the memory-leak-scraper branch April 5, 2016 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix [LB-68] Memory Leak | Scraper #72

Fix [LB-68] Memory Leak | Scraper #72

pinkeshbadjatiya commented Mar 7, 2016

gentlecat commented Mar 7, 2016

pinkeshbadjatiya commented Mar 7, 2016

gentlecat commented Mar 7, 2016

mwiencek commented Mar 7, 2016

pinkeshbadjatiya commented Mar 7, 2016

pinkeshbadjatiya commented Mar 12, 2016

mwiencek commented Mar 12, 2016

pinkeshbadjatiya commented Mar 13, 2016

alastair commented Mar 18, 2016

pinkeshbadjatiya commented Mar 18, 2016

pinkeshbadjatiya commented Mar 19, 2016

pinkeshbadjatiya commented Mar 19, 2016

kpmcc commented Mar 21, 2016

mwiencek commented Mar 21, 2016

alastair commented Apr 5, 2016

Fix [LB-68] Memory Leak | Scraper #72

Fix [LB-68] Memory Leak | Scraper #72

Conversation

pinkeshbadjatiya commented Mar 7, 2016

gentlecat commented Mar 7, 2016

pinkeshbadjatiya commented Mar 7, 2016

gentlecat commented Mar 7, 2016

mwiencek commented Mar 7, 2016

pinkeshbadjatiya commented Mar 7, 2016

pinkeshbadjatiya commented Mar 12, 2016

mwiencek commented Mar 12, 2016

pinkeshbadjatiya commented Mar 13, 2016

alastair commented Mar 18, 2016

pinkeshbadjatiya commented Mar 18, 2016

pinkeshbadjatiya commented Mar 19, 2016

pinkeshbadjatiya commented Mar 19, 2016

kpmcc commented Mar 21, 2016

mwiencek commented Mar 21, 2016

alastair commented Apr 5, 2016