Memory usage and tail calls #534

yorickhardy · 2024-03-25T19:10:00Z

Hello,

I am trying out (srfi 18) in cyclone and noticed that memory was being "consumed" very quickly by a simple monitoring thread, the following small example seems to exhibit this behaviour:

(define (busy x) (busy (+ x 1)))
(busy 0)

Similarly, the cyclone compiler can be encouraged to use excessive memory (and compile time) by:

(define (compile-forever x) x (compile-forever x))
(compile-forever 0)

Both examples work as expected in icyc (with regards to memory use).

Am I doing something unreasonable here? I am using cyclone-0.36.0 on NetBSD 10.99.10 amd64 (current-ish).

The text was updated successfully, but these errors were encountered:

justinethier · 2024-03-26T02:31:31Z

Hi @yorickhardy. First of all, thanks for the report! These both seem like real issues, let me find some time this week to take a closer look.

justinethier · 2024-04-02T02:35:22Z

Regarding:

(define (busy x) (busy (+ x 1)))
(busy 0)

Suspect this is related to the amount of memory added to a thread after major GC. Need to look into this more.

Regarding the compile-forever example, there is a problem with the CPS optimization beta expansion phase where it is failing to detect a recursive call, and executing forever. Looks like the problem (or at least part of it) is that analyze:find-recursive-calls does not properly handle edge cases where an ast:lambda call is made, or later on when the optimizer changes such calls to use Cyc-seq, EG:

(analyze:find-recursive-calls                                                    
  scan                                                                           
  app                                                                            
  (Cyc-seq x$1$2 (compile-forever k$5 x$1$2)))

Perform full scanning of function application list to ensure self-recursive calls are found. This prevents infinite loops in the beta expansion code when compiling simple recursive calls.

yorickhardy · 2024-04-03T19:20:37Z

Thanks!

I am sorry that I have not contributed any fixes, I am going to (eventually) try to put more time into understanding cyclone and help where/if I can.

justinethier · 2024-04-04T00:47:13Z

No worries, and bug reports are always appreciated! I would welcome fixes as well, however, these two issues in particular are in areas that would be difficult to track down... and I still need to investigate the first one :)

justinethier · 2024-04-09T02:34:25Z

As this runs, + will eventually cause a conversion from fixnum integer to a bignum. I suspect that cyclone is inefficiently allocating memory for bignums in this use case and that is the cause of the high memory usage.

Consider the memory usage when using fixnums or doubles:

(import (srfi 143))                                                             
(define (busy x) (busy (fx+ x 1)))                                              
(busy 0)

(define (busy x) (busy (+ x 1.0)))                                              
(busy 0.0)

That said, I suspect we can do better here, especially since the interpreters can. Will need to spend time looking into this further.

justinethier · 2024-04-10T02:28:04Z

Snippet of code from Cyc_sum:

    } else if (tx == bignum_tag && ty == -1) { \                                 
        BIGNUM_CALL(mp_init(&bn_tmp2)); \                                        
        Cyc_int2bignum(obj_obj2int(y), &bn_tmp2); \                               
        BIGNUM_CALL(BN_OP(&(x->bignum_t.bn), &bn_tmp2, &(x->bignum_t.bn))); \    
        mp_clear(&bn_tmp2); \

Compare with code from Cyc_fast_sum used in compiled code:

  if (is_object_type(x) && type_of(x) == bignum_tag) {                           
    if (obj_is_int(y)) {                                                         
      mp_int bny;                                                                
      BIGNUM_CALL(mp_init(&bny));                                                
      Cyc_int2bignum(obj_obj2int(y), &bny);                                      
      alloc_bignum(data, bn);                                                    
      BIGNUM_CALL(mp_add(&bignum_value(x), &bny, &bignum_value(bn)));            
      mp_clear(&bny);                                                            
      return bn;

Questions: Why are we doing an allocation here but not above, and can we safely speed up / optimize the latter code?

yorickhardy · 2024-04-10T23:44:31Z

I did not realize that it had switched over to bignum! Thanks. I have more or less isolated the memory usage problems that I was encountering (firstly, I was using many short lived threads instead of a thread pool and assumed that thread-join would (eventually) garbage collect the thread: that assumption is false as far as I can tell). Here is another example with bounded(?) but large memory use:

;  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
; 9949 yorick    25    0   556M  425M CPU/1       0:17 97.91% 56.10% test
(define (loop-test)
  (let ( (o (open-output-string)) )
    (display "abc" o)
    (close-output-port o)
    (loop-test) ) )

(loop-test)

and similarly with constantly growing memory use:

;  PID USERNAME PRI NICE   SIZE   RES STATE       TIME   WCPU    CPU COMMAND
; 3563 yorick    26    0    13G   10G CPU/0       0:49   164%   128% test

(define output-port (make-parameter #f))

(define (loop-test)
  (parameterize ( (output-port (open-output-string)) )
    (display "abc" (output-port))
    (close-output-port (output-port))
    (loop-test) ) )

(loop-test)

which surprised me! This simple example grows a bit slower but in the same way:

(define a-string (make-parameter #f))

(define (loop-test)
 (parameterize ( (a-string "123") )
  (loop-test) ) )

(loop-test)

Strangely, icyc manages fine. I guess the use of parameterize here is not great since one can easily re-write this example to avoid the nested parameterizations.

I am not sure whether these examples are helpful, or just examples of poorly written scheme!

At this point I am not convinced that the remaining part is a valid issue, or poor programming on my part - so please close the issue if you are satisfied.

yorickhardy · 2024-04-11T21:54:10Z

Hello again,

The memory consumption that motivated this issue is all due to the use of many threads, via srfi-18 and Cyc_scm_call. I have switched to thread pools and Cyc_scm_call_no_gc and the memory consumption is now normal.

I was very surprised that terminated threads consume memory. Nevertheless, I don't think the issue is entirely valid as reported, since the underlying problem was threads (not GC and tail calls). Apologies if I have wasted too much of your time.

(I do appreciate that you have looked into the reported examples and that you are willing to investigate improvements.)

Thanks!

justinethier · 2024-04-15T01:07:24Z

Glad you got it working @yorickhardy!

I appreciate your feedback and think there are genuine issues that are being raised here, though I have not dug into your latest loop-test examples. Let me spend some time looking into everything you brought up to see what can be improved.

justinethier · 2024-04-23T01:36:41Z

@yorickhardy Do you have an example of thread-join not freeing up memory?

yorickhardy · 2024-04-23T21:02:41Z

Sure! I hope I have not missed anything obvious ...

(import (scheme base) (scheme write) (srfi 18))

(define (monitor n)
  (thread-yield!)
  n)

(define (wait-for-monitor next)
  (let ( (t (make-thread (lambda () (monitor next)))) )
    (thread-start! t)
    (thread-yield!)
    (thread-join! t)
    (wait-for-monitor (+ next 1.0)) ) )

(wait-for-monitor 0.0)

yorickhardy · 2024-09-05T21:41:04Z

Hello again,

I am not sure if this is the whole story, but after adding a bit of debug output it seems that the memory allocated in %alloc-thread-data is never freed (in the example "monitor" program above). I am also not sure when Cyc-end-thread! is called.

I am not yet able to make a more meaningful contribution, but I am trying to work towards it!

justinethier · 2024-09-07T02:11:04Z

Hey @yorickhardy, appreciate the update!

That's interesting.... we do malloc a thread data object in %alloc-thread-data. What is supposed to happen is that when the thread ends we call gc_remove_mutator to place that data on a list of old mutators (application threads) to be cleaned up. Eventually that list gets cleaned up by gc_free_old_thread_data() which is called on the main thread after a major GC trace is complete.

I wonder, could it be that major GC is not being triggered by the example program?

yorickhardy · 2024-09-08T16:40:37Z

Yes, that seems to be the beginning of the issue. Am I correct in saying that the collector only starts (sometimes) when allocating scheme objects? In my debugging output, gc_free_old_thread_data() was never called, because the collector remained in the STAGE_RESTING state.

I hacked together a workaround to force the collector out of the STAGE_RESTING state, then the free happens (the workaround is very ugly, I will try to think of a better way to do this); but memory usage still increases -- I still need to check if the vector with the thread information and result is ever freed, that is possibly the second reason for the increasing memory usage.

justinethier · 2024-09-09T01:46:06Z

Correct, the collector will only trigger a major GC when allocating objects and the runtime detects a need to start that collector. For example, a percentage of memory being used up.

yorickhardy · 2024-09-09T22:09:24Z

Thanks.

In gc_thread_data_free, gc_merge_all_heaps adds the terminated thread's heap(s) to the primordial thread. Could this be the largest contribution to the increased usage of memory?

I still need to track down how the heap allocations are eventually freed, and then I can try to force the freeing of memory to see if that shows better memory use for the example program.

justinethier self-assigned this Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage and tail calls #534

Memory usage and tail calls #534

yorickhardy commented Mar 25, 2024

justinethier commented Mar 26, 2024

justinethier commented Apr 2, 2024

yorickhardy commented Apr 3, 2024

justinethier commented Apr 4, 2024

justinethier commented Apr 9, 2024

justinethier commented Apr 10, 2024

yorickhardy commented Apr 10, 2024

yorickhardy commented Apr 11, 2024

justinethier commented Apr 15, 2024

justinethier commented Apr 23, 2024

yorickhardy commented Apr 23, 2024

yorickhardy commented Sep 5, 2024

justinethier commented Sep 7, 2024

yorickhardy commented Sep 8, 2024

justinethier commented Sep 9, 2024

yorickhardy commented Sep 9, 2024

Memory usage and tail calls #534

Memory usage and tail calls #534

Comments

yorickhardy commented Mar 25, 2024

justinethier commented Mar 26, 2024

justinethier commented Apr 2, 2024

yorickhardy commented Apr 3, 2024

justinethier commented Apr 4, 2024

justinethier commented Apr 9, 2024

justinethier commented Apr 10, 2024

yorickhardy commented Apr 10, 2024

yorickhardy commented Apr 11, 2024

justinethier commented Apr 15, 2024

justinethier commented Apr 23, 2024

yorickhardy commented Apr 23, 2024

yorickhardy commented Sep 5, 2024

justinethier commented Sep 7, 2024

yorickhardy commented Sep 8, 2024

justinethier commented Sep 9, 2024

yorickhardy commented Sep 9, 2024