Big memory leak when adding / updating / importing products #2563

stephanegigandet · 2019-11-08T08:06:39Z

There is a big memory leak when we update a product. It can be seen in particular when we launch the import_csv_file.pl script with a large (e.g. 1000) amount of products : the process keeps growing until it runs out of memory.

For a real product import of a big producer (e.g. 5000 products), the process grows to 10 Gb of memory.

There are many different things that happen when a product is updated: the ingredients lists are processed, allergens and additives detected, quality is measured etc. Each of those steps involve calling many different subroutines, in particular many functions from Tags.pm related to taxonomies.

At this point, we do not know what exact part of the pre-saving processing causes the memory leak (it is very likely that several parts contribute to it).

This needs further investigation, maybe by testing each part individually (e.g. running each function tens of thousands of times and measuring memory), and/or with tools like Devel::Gladiator etc.

Steps to see the memory leak:

Use a CSV file of product data that looks like this:

code	lc	product_name_en	ingredients_text_en
7000000000000	en	Test product	Flour, salt, sugar, cranberries, water, lemon, orange, strawberries, pears, grapes
7000000000001	en	Test product	Flour, salt, sugar, cranberries, water, lemon, orange, strawberries, pears, grapes
7000000000002	en	Test product	Flour, salt, sugar, cranberries, water, lemon, orange, strawberries, pears, grapes
[..] (with thousands of products)

in /srv/off/scripts/ , run:

./import_csv_file.pl --user_id debug --org debug --source_id debug --source_name debug --source_url debug --define lc=en --csv_file debug-code-product_name-ingredients2.csv

debug-code-product_name-ingredients2.csv.zip

Related discussions/issues: #2053 #2054

This issue is very important to resolve, because it also happens (more gradually) in production on the web site: whenever a product is updated, the Apache mod_perl process grows, and eventually it needs to be restarted.

Part of

🚅 Performance (Tracker) #8764

zigouras · 2019-11-10T00:14:25Z

I've spent some time trying to profile the import products Perl script.
I tried a number of memory profiling packages.
I was able to get output from Devel::Gladiator but it was hard to understand in some parts and other parts were just not useful. I tried Devel::Leak and Test::Memory::Leak and the process just hung and did not return anything.

In general, I would recommend having as few variables in global scope that are exported and shared between packages as possible. I noticed a lot of global scoping and exports which could be a concern. These are less likely to be garbage-collected. You should prefer variables in methods as they will be garbage-collected once the method is done executing.

I found a nice module called Devel::NYTProf. It does not profile memory so may not be ideal for our case. However, it has a detailed and expressive analysis of code execution times. It gives a graphical output that is very intuitive and rich.
It is easy to install and run without changing any code in your program. I would suggest you run it.
https://metacpan.org/pod/Devel::NYTProf

The output revealed a couple of things that might be worth looking further at:

ProductOpener::Display::remove_tags_and_quote is called a lot. It may be worth cleaning the CSV first with one script and saving it as a clean output file and then import that with a second script.
The initialization of some ProductOpener modules took about 1 second with one call. This may be a one time thing and is related to the extra initialization code in Tags.pm. I would remove as much needless initialization code that is outside of functions and in the global scope. This is more likely to cause memory problems.

Here is a screenshot of just some of the report when I ran the script on 100 products that had already been imported:

Here is a tree map of the time spent in each function. It is very informative but may not be helpful for our problems:

zigouras · 2019-11-10T01:03:16Z

You may also want to make sure all of your CPAN modules are updated to the latest versions as older versions may have bugs that contribute to the memory leaks.

Make sure you don't have any circular references in your code. Perl cannot garbage collect them
https://perlmaven.com/eliminate-circular-reference-memory-leak-using-weaken
There is a module to help you find them:
https://metacpan.org/pod/Devel::Cycle

A good page with some references to Perl best practices to memory management:
https://metacpan.org/pod/MemoryProfiling

I am also looking at the Devel::MAT which seems like the most robust memory profiler suite for Perl. Here is a good tutorial if you want to get started:
http://perladvent.org/2017/2017-12-12.html
Note that you don't have to use a die(); command to get the dump, you can capture it on a signal interrupt (Ctrl-\) or at the end of the script, see https://metacpan.org/pod/Devel::MAT::Dumper

You can also use Memory::Stats to measure memory at checkpoints you specify in your code. You could see how the memory changes before and after certain blocks of code:
https://metacpan.org/pod/Memory::Stats

stephanegigandet · 2019-11-10T15:27:00Z

Hello @zigouras ! Thank you very much for taking the time to do this analysis!

The reports from Devel::NYTProf are very interesting, knowing which function is called the most and from where is going to be very useful, I'll definitely investigate what we can do for ProductOpener::Display::remove_tags_and_quote

I read about the circular references in the past but I wasn't able to identify where I'm creating some, I don't think I create them (at least not intentionaly).

hangy · 2019-11-10T20:59:02Z

Generally, according to Devel::MAT, the largest objects (after importing 100 products) are still synonyms (with some 20 MiB each), which should be pretty constant.

I also tried outputting Devel::MAT after every import iteration and ran pmat-leakreport on three files (1st, 50th, and 100th iteration). I'm not sure if any of the potential leaks actually are leaks, but to me, it would not make too much sense to have that many matches of MongoDB instances being alive after 100 iterations ...

zigouras · 2019-11-10T22:08:18Z

I ran the latest import csv script on all 1000 items and it only used 0.5 GB of memory and was not increasing on my laptop.
I also used Devel::MAT and the pmap program to analyze the memory dump after 900 records were inserted. I didn't see any suspicious objects, the largest SV was the strtab at about 30 MB. The rest of the objects were distributed pretty evenly in terms of size.

My Devel::MAT analysis:

pmat [more]> count
  Kind         Count (blessed)        Bytes (blessed)
  ARRAY       225525        40     22.2 MiB   3.5 KiB
  CODE         17291        10      2.1 MiB   1.2 KiB
  GLOB         31189         2      4.5 MiB 304 bytes
  HASH        321132      4435    200.6 MiB   1.3 MiB
  INVLIST        490                1.1 MiB          
  IO              35        35      5.5 KiB   5.5 KiB
  PAD          14600                4.8 MiB          
  REF         547867               12.5 MiB          
  REGEXP        3445       201    753.6 KiB  44.0 KiB
  SCALAR     3759076        17    198.2 MiB   1.2 KiB
  STASH         1780                1.5 MiB          
  -------    ------- ---------    --------- ---------
  (total)    4922430      4740    448.3 MiB   1.4 MiB

Thanks @hangy for his analysis. We may want to review how we use MongoDB in Perl. Database connections are always a likely place for performance issues. Do we use connection pooling?
I will keep looking.

zigouras · 2019-11-10T22:27:34Z

It looks like certain Perl regular expressions can take up a lot of memory. If a regex contains $&, $` or $' it could be a big problem.
https://stackoverflow.com/questions/165660/why-is-my-perl-regex-using-so-much-memory

I am trying to get https://metacpan.org/pod/Devel::FindAmpersand installed but I am getting errors. It doesn't seem to work on perl later than 5.18.4 per the CPAN unit test so I am downloading that and trying to build it. If anyone can get Devel::FindAmpersand working it seems the best way to find the regex culprits in our code and all included modules.

Update: I built perl 5.18.4 and successfully installed Devel::FindAmpersand with it but couldn't get it to work with other project dependencies.
I used simple grep commands to find the ampersand regexes and found a lot in our ProductOpener library. See my report here. Refactoring these seems like a good option to try to reduce memory.
Update: It looks like Devel::NYTProf reports on these regexes. I will have to inspect its output again, I wasn't looking for it the first time.

hangy · 2019-11-11T00:23:43Z

We may want to review how we use MongoDB in Perl. Database connections are always a likely place for performance issues. Do we use connection pooling?

openfoodfacts-server/lib/ProductOpener/Data.pm

Line 90 in 8cb6fd2

sub get_mongodb_client() {

holds a single $client reference. As per https://metacpan.org/pod/MongoDB::MongoClient, the actual socket is opened lazily and should be open until it goes out of scope. We don't really clean up the $client (which I think is for the web app to reuse the client?), but there shouldn't be any additional instances.

My Devel::MAT analysis shows that regexes use the most memory:

Really? It looks to be constant for me; After 50 products:

  Kind         Count (blessed)        Bytes (blessed)
  ARRAY       224130        39     20.7 MiB   3.4 KiB
  CODE         18558        10      2.3 MiB   1.2 KiB
  GLOB         33159         2      4.8 MiB 304 bytes
  HASH        230322      4451    129.2 MiB   1.3 MiB
  INVLIST        450              981.2 KiB
  IO              37        37      5.8 KiB   5.8 KiB
  PAD          15367                5.0 MiB
  REF         456683               10.5 MiB
  REGEXP        3491       196    763.7 KiB  42.9 KiB
  SCALAR     2206545        17    116.7 MiB   1.2 KiB
  STASH         1827                1.6 MiB
  -------    ------- ---------    --------- ---------
  (total)    3190569      4752    292.5 MiB   1.4 MiB

After 100 products:

pmat> count
  Kind         Count (blessed)        Bytes (blessed)
  ARRAY       224130        39     20.7 MiB   3.4 KiB
  CODE         18558        10      2.3 MiB   1.2 KiB
  GLOB         33159         2      4.8 MiB 304 bytes
  HASH        230322      4451    129.3 MiB   1.3 MiB
  INVLIST        450              981.2 KiB
  IO              37        37      5.8 KiB   5.8 KiB
  PAD          15367                5.0 MiB
  REF         456683               10.5 MiB
  REGEXP        3491       196    763.7 KiB  42.9 KiB
  SCALAR     2210165        17    116.9 MiB   1.2 KiB
  STASH         1827                1.6 MiB
  -------    ------- ---------    --------- ---------
  (total)    3194189      4752    292.7 MiB   1.4 MiB

Maybe I did something wrong (too small sample size might be a problem), but I only see a ~0.2 MiB increase in SCALAR.

zigouras · 2019-11-11T00:26:45Z

@hangy Sorry I read the pmap count report wrong, I thought REGEXP was listed in Mb but it is Kb. I do not see it growing over your run.
I still think it is worth investigating the problematic regexes as described above. I am not completely sure how to interpret the output of pmac. It may be that the problematic regexes are storing look-back data as scalars which is why that is the biggest type.

remove useless calls to remove_tags_and_quote(), bug #2574 #2563

stephanegigandet · 2019-11-11T18:55:58Z

Here is another file to import with 100 real products, with full ingredients lists etc. They trigger much more processing involving ingredients analysis, taxonomies etc.

import-real-products.100.csv.zip

stephanegigandet · 2019-11-11T19:00:52Z

Something I forgot to mention: if you run the import multiple times, then the import script should detect that the product did not change, and there will be less processing (and the products won't be saved again in the disk and Mongo).

For the import script, the only MongoDB use should be when adding/updating each product.

zigouras · 2019-11-11T23:47:25Z

Thanks for the update @stephanegigandet . I did my memory analysis on a run that was adding 1000 new test products to the system. I deleted all the products on disk and cleared the mongodb table before my run. I will try with your more extensive data set.
I think this is the most promising clue at this point: Regex memory issues

stephanegigandet · 2019-11-12T08:41:06Z

@zigouras Indeed, as you found, I've been using $' and $`extensively, I wasn't aware that they caused memory issues. We can certainly try replacing them to see if it affects the memory usage, using variations of this:

"In general, apply /^(.)(pattern)(.)$/s and use $1 for $`, $2 for $& and $+ for $"

hangy · 2019-11-12T11:04:23Z

I think this is the most promising clue at this point: Regex memory issues

I agree that this should be tested. However, it's probably not as bad as the 2008 Stack Overflow answer makes it sound: According to perlvar, the gravest problems with those variables were supposedly fixed in 5.18 (2014) and 5.20 (2015):

In Perl 5.18.0 onwards, perl started noting the presence of each of the three variables separately, and only copied that part of the string required
In Perl 5.20.0 a new copy-on-write system was enabled by default, which finally fixes all performance issues with these three variables, and makes them safe to use anywhere.

stephanegigandet · 2019-11-12T11:24:01Z

Performance wise (maybe not memory wise), it would probably be useful to see if we can optimize get_string_id_for_lang() which is by far the function that is called the most.

E.g. in one of my import test runs:

Calls	P	F	ExclusiveTime	InclusiveTime	Subroutine
5889063	35	8	143s	174s	ProductOpener::Store::get_string_id_for_lang
7453	1	1	52.6s	260s	ProductOpener::Tags::spellcheck_taxonomy_tag
13140748	63	1	45.7s	45.7s	ProductOpener::Ingredients::CORE:regcomp (opcode)
99	1	1	42.8s	465s	ProductOpener::Ingredients::extract_ingredients_classes_from_text
71301254	30	1	31.5s	31.5s	ProductOpener::Store::CORE:subst (opcode)

We could try to merge most of the =~ s// calls, and turn as many as possible into tr// calls instead which seem to be less costly.

zigouras · 2019-11-13T00:43:06Z

Sounds good guys. Note that Devel::NYTProf reports on the problematic regexes. @hangy makes a good point that newer versions of Perl fixed the issue with these regex classes. However, I still think if we have memory issues it would be worth trying to refactor out these regexes to see if it improves memory.

Question: is the dev and production Perl compiled with thread support or not?

github-actions · 2020-07-05T01:34:13Z

Stale issue message

manon-corneille · 2023-12-21T11:57:34Z

Should we close this issue?

alexgarel · 2024-01-23T13:17:04Z

This issue should be tested to see if we still have the bug.

See bug description.

stephanegigandet added 🐛 bug This is a bug, not a feature request. 🎯 P1 candidate labels Nov 8, 2019

stephanegigandet mentioned this issue Nov 10, 2019

Remove useless calls to ProductOpener::Display::remove_tags_and_quote in Import.pm #2574

Closed

stephanegigandet added a commit that referenced this issue Nov 10, 2019

remove useless calls to remove_tags_and_quote(), bug #2574 #2563

600625a

VaiTon added the 🚅 Performance label Nov 10, 2019

VaiTon removed the 🎯 P1 candidate label Nov 11, 2019

stephanegigandet added a commit that referenced this issue Nov 11, 2019

Merge pull request #2575 from openfoodfacts/issue-2574/useless-calls

3271d7e

remove useless calls to remove_tags_and_quote(), bug #2574 #2563

hangy mentioned this issue Nov 12, 2019

Cache CLDR and it's formatters #2587

Merged

hangy mentioned this issue Nov 13, 2019

file_id regex for UUIDs does not match description #2592

Closed

stephanegigandet mentioned this issue Nov 17, 2019

Optimize ProductOpener::Tags::remove_stopwords() #2607

Closed

VaiTon added P2 needs investigation labels Nov 18, 2019

VaiTon changed the title ~~Big memory leak when adding / updating / importing products - Needs investigation~~ Big memory leak when adding / updating / importing products Nov 23, 2019

teolemon added the Data import label Nov 27, 2019

github-actions bot added the no-issue-activity label Jul 5, 2020

teolemon removed the no-issue-activity label Nov 3, 2020

teolemon added the 🏭 Producers Platform https://wiki.openfoodfacts.org/Platform_for_producers label Dec 23, 2020

manon-corneille added the Fixed ? This issue might have been fixed. We can close them once we verify that they have. label Jun 2, 2023

teolemon mentioned this issue Jul 31, 2023

🚅 Performance (Tracker) #8764

Open

manon-corneille assigned stephanegigandet Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big memory leak when adding / updating / importing products #2563

Big memory leak when adding / updating / importing products #2563

stephanegigandet commented Nov 8, 2019 •

edited by teolemon

zigouras commented Nov 10, 2019 •

edited

zigouras commented Nov 10, 2019 •

edited

stephanegigandet commented Nov 10, 2019

hangy commented Nov 10, 2019 •

edited

zigouras commented Nov 10, 2019 •

edited

zigouras commented Nov 10, 2019 •

edited

hangy commented Nov 11, 2019

zigouras commented Nov 11, 2019 •

edited

stephanegigandet commented Nov 11, 2019

stephanegigandet commented Nov 11, 2019

zigouras commented Nov 11, 2019

stephanegigandet commented Nov 12, 2019

hangy commented Nov 12, 2019

stephanegigandet commented Nov 12, 2019

zigouras commented Nov 13, 2019 •

edited

github-actions bot commented Jul 5, 2020

manon-corneille commented Dec 21, 2023 •

edited

alexgarel commented Jan 23, 2024

Big memory leak when adding / updating / importing products #2563

Big memory leak when adding / updating / importing products #2563

Comments

stephanegigandet commented Nov 8, 2019 • edited by teolemon

Part of

zigouras commented Nov 10, 2019 • edited

zigouras commented Nov 10, 2019 • edited

stephanegigandet commented Nov 10, 2019

hangy commented Nov 10, 2019 • edited

zigouras commented Nov 10, 2019 • edited

zigouras commented Nov 10, 2019 • edited

hangy commented Nov 11, 2019

zigouras commented Nov 11, 2019 • edited

stephanegigandet commented Nov 11, 2019

stephanegigandet commented Nov 11, 2019

zigouras commented Nov 11, 2019

stephanegigandet commented Nov 12, 2019

hangy commented Nov 12, 2019

stephanegigandet commented Nov 12, 2019

zigouras commented Nov 13, 2019 • edited

github-actions bot commented Jul 5, 2020

manon-corneille commented Dec 21, 2023 • edited

alexgarel commented Jan 23, 2024

stephanegigandet commented Nov 8, 2019 •

edited by teolemon

zigouras commented Nov 10, 2019 •

edited

zigouras commented Nov 10, 2019 •

edited

hangy commented Nov 10, 2019 •

edited

zigouras commented Nov 10, 2019 •

edited

zigouras commented Nov 10, 2019 •

edited

zigouras commented Nov 11, 2019 •

edited

zigouras commented Nov 13, 2019 •

edited

manon-corneille commented Dec 21, 2023 •

edited