Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARC] SIGSEGV in arc.nim calling eqcopy for combination of ref objects, noInit, ... #16253

Open
Vindaar opened this issue Dec 4, 2020 · 3 comments

Comments

@Vindaar
Copy link
Contributor

Vindaar commented Dec 4, 2020

Another hard to isolate bug, that might be superficially related to #16185 .

Essentially I get a segmentation fault when larger tensors are involved and I try to convert them to a dataframe Column (a ref object), which uses asType internally because it's a variant object.

With the exact combination of

  • types
  • sizes of involved objects
  • {.noInit.} on asType
  • construction of result Column

I can reproduce it based on the example from #16185.

Sorry, that these are getting larger. :/

Example

import sequtils

type
  CpuStorage*[T] = ref CpuStorageObj[T]
  CpuStorageObj[T] = object
    size*: int
    raw_buffer*: ptr UncheckedArray[T]
  Tensor[T] = object
    buf*: CpuStorage[T]

proc `=destroy`[T](s: var CpuStorageObj[T]) =
  s.raw_buffer.deallocShared()
  s.size = 0
  s.raw_buffer = nil

proc `=`[T](a: var CpuStorageObj[T]; b: CpuStorageObj[T]) {.error.}

proc allocCpuStorage[T](s: var CpuStorage[T], size: int) =
  new(s)
  s.raw_buffer = cast[ptr UncheckedArray[T]](allocShared0(sizeof(T) * size))
  s.size = size

proc newTensor[T](size: int): Tensor[T] =
  allocCpuStorage(result.buf, size)

proc `[]`[T](t: Tensor[T], idx: int): T = t.buf.raw_buffer[idx]
proc `[]=`[T](t: Tensor[T], idx: int, val: T) = t.buf.raw_buffer[idx] = val
func size[T](t: Tensor[T]): int = t.buf.size

proc toTensor[T](s: seq[T]): Tensor[T] =
  result = newTensor[T](s.len)
  for i, x in s:
    result[i] = x

type
  Column* = ref object # works if normal object
    fCol: Tensor[float]

proc asType*(t: Tensor[float]): Tensor[float] {.noInit.} = # works if `noInit` removed
  result = t

proc toColumn*(t: Tensor[float]): Column =
  result = Column(fCol: t.asType())
  # works with regular contruction of ref object:
  #result = new Column
  #result.fCol = t.asType()

proc theBug =
  # replacing toSeq.mapIt by a for loop makes it go away
  # broken starting from `32252` on my machine, toSeq(0 .. 32251).mapIt(it.float).toTensor() works
  let occ = toSeq(0 .. 32252).mapIt(it.float).toTensor()
  let c = toColumn occ

theBug()

Current Output

Hint: /tmp/testbug  [Exec]
Traceback (most recent call last)
/tmp/testbug.nim(55)     testbug
/tmp/testbug.nim(53)     theBug
/tmp/testbug.nim(44)     toColumn
/tmp/testbug.nim(40)     asType
/home/basti/src/nim/nim_git_repo/lib/system/arc.nim(198) nimDecRefIsLast
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Error: execution of an external program failed: '/tmp/testbug '

Expected Output

None and no segmentation fault.

Additional Information

As mentioned in comments in the code sample. The bug is hard to reproduce, because changing one of multiple parameters makes it go away:

  • make Column a normal object and it works
  • explicitly use result = new Column syntax in toColumn and it works (I use that syntax, because Column normally is a variant object and t.kind = <enumVal> isn't allowed anymore)
  • replace toSeq.mapIt by for loop and it works
  • make toSeq upper range 1 smaller and it works
  • remove noInit on asType and it works

Not sure if it's any added information for someone familiar with ARC internals, but I was surprised to eqcopy show up: stack trace from gdb after compilation with --gc:arc --debugger:native:

#0  nimDecRefIsLast (p=0x204) at /home/basti/src/nim/nim_git_repo/lib/system/arc.nim:198
#1  eqcopy___ZTkggGiEJlkYjSbKwsz8GA (dest=0x7ffff7c1e048, src=...) at /tmp/testbug.nim:40
#2  0x000055555556306a in asType__GdnHu0zEwnuQjaw48Es9aSw (t=..., Result=0x7ffff7c1e048) at /tmp/testbug.nim:40
#3  0x000055555556314d in toColumn__tQPWDkYC78GFCnyDyUlY4A (t=...) at /tmp/testbug.nim:44
#4  0x0000555555563ce0 in theBug__mBIjvFhk1qUNwF6uZ16Msw () at /tmp/testbug.nim:55
#5  0x0000555555563e68 in NimMainModule () at /tmp/testbug.nim:57
#6  0x0000555555563eb1 in NimMainInner () at /home/basti/src/nim/nim_git_repo/lib/system.nim:2167
#7  0x0000555555563db6 in NimMain () at /home/basti/src/nim/nim_git_repo/lib/system.nim:2175
#8  0x0000555555563dd8 in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>) at /home/basti/src/nim/nim_git_repo/lib/system.nim:2182

I suppose eqcopy makes sense, because it cannot be a sink for occ might be used in a later context.

Nim Compiler Version 1.5.1 [Linux: amd64]
Compiled at 2020-12-04
Copyright (c) 2006-2020 by Andreas Rumpf

git hash: edce5897a52aac308fbe9b14f155139f15da02cd
active boot switches: -d:release -d:danger
@Yardanico
Copy link
Collaborator

Yardanico commented Apr 19, 2022

I don't really know what is causing this, but the error seems to come from this:

  • In toColumn we have a call to asType which looks like this:
	T1_ = (tyObject_ColumncolonObjectType___pD6SWjFNwFAKrxUOQIaAeQ*) nimNewObjUninit(sizeof(tyObject_ColumncolonObjectType___pD6SWjFNwFAKrxUOQIaAeQ), NIM_ALIGNOF(tyObject_ColumncolonObjectType___pD6SWjFNwFAKrxUOQIaAeQ));
	asType__tata_35(t, (&(*T1_).fCol));
  • asType just calls eqcopy - eqcopy___tata_43(Result, t);
  • eqcopy calls nimDecRefIsLast on the dest argument, which is the T1_ from above - nimDecRefIsLast((*dest).buf);
  • If the pointer isn't nil, nimDecRefIsLast dereferences the pointer (p-8 specifically), which is not valid in this case, so the application SIGSEGVs.

Also probably important to mention that asType here is NRVO'd, so it has the signature N_LIB_PRIVATE N_NIMCALL(void, asType__tata_35)(tyObject_Tensor__se8ci6pBfGB8ZjC9aW2QbHQ t, tyObject_Tensor__se8ci6pBfGB8ZjC9aW2QbHQ* Result)

That's basically how it crashes. If you compile with -d:useMalloc it'll likely not crash, but that's probably just due to malloc on my machine zeroing memory automatically. As far as I understand, Nim shouldn't call nimNewObjUninit and instead use normal nimNewObj here, because asType calls ecopy.

One thing that remains a mystery for me is why it starts doing that at 32252 specifically, but probably this has something to do with memory allocations.

@Araq
Copy link
Member

Araq commented Oct 3, 2022

While the real solution here is still not entirely clear this code here:

proc asType*(t: Tensor[float]): Tensor[float] {.noInit.} = # works if `noInit` removed
  result = t

is simply bad. The code generator computes noInit on its own since at least Nim version 1.2 and the noInit can be left out here.

@Vindaar
Copy link
Contributor Author

Vindaar commented Oct 4, 2022

While the real solution here is still not entirely clear this code here:

proc asType*(t: Tensor[float]): Tensor[float] {.noInit.} = # works if `noInit` removed
  result = t

is simply bad. The code generator computes noInit on its own since at least Nim version 1.2 and the noInit can be left out here.

Fair enough. To be honest I wasn't aware that the compiler computes noInit automatically. That's great. On the other hand this is still in arraymancer for support of 1.0 (which arguably can be dropped nowadays I guess).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants