1.7 should include this memory leak fix #2969

Closed
mrjbq7 opened this Issue Feb 7, 2013 · 21 comments

Projects

None yet

7 participants

@mrjbq7
mrjbq7 commented Feb 7, 2013

While investigating a memory leak in scikit-learn (scikit-learn/scikit-learn#1663), I discovered that the memory leak was caused by a bug in numpy.

The fix to the memory leak was committed to numpy in September 2012 (80b3a34), but isn't in the maintenance/1.7.x branch.

Are there plans to backport it to 1.7 and make a new release?

Thanks,

@mrjbq7 mrjbq7 referenced this issue in scikit-learn/scikit-learn Feb 7, 2013
Closed

Memory leak in LocallyLinearEmbedding #1663

@charris
Member
charris commented Feb 7, 2013

At this point I think it should go into 1.7.1, which should probably come out in two or three months.

@mrjbq7
mrjbq7 commented Feb 7, 2013

Thats too bad - the fix seems innocuous and well tested, no?

@charris
Member
charris commented Feb 7, 2013

Poor timing. If you had prodded our memories a month ago or responded to Ondrej's posts about what needed to go into the release it would have helped, but I know it is hard to keep an eye on everything, us too. Various needed fixes and newly discovered bugs are going to be in every release. The best cure is to release more often, for if we wait for perfection nothing will ever get released. You might want to post on the list and mention this as something that needs to be in 1.7.1.

@charris
Member
charris commented Feb 7, 2013

@certik We should probably start keeping a list of things for a 1.7.1 fixup release.

@mrjbq7
mrjbq7 commented Feb 7, 2013

I hear you - wish I'd run into it earlier also. Although, to be fair the bug was discovered and fixed 5 months ago, and will likely be re-discovered many more times before numpy gets around to including it in a release that users upgrade to.

Perhaps the solution is to be more conscientious about back-porting critical fixes to maintenance branches.

The fact that many users of numpy are using 1.7.0b2 is kind of a sign that releases are too few and too far between. If 1.7.0 isn't even out yet, but many users are using an old beta version of it, that's a sign of something broken.

Thanks for being responsive.

@nouiz
nouiz commented Feb 7, 2013

@charris do you try to make the last rc the same as the final? I think that this is what you try to do. I'm not sure it is a good goal, but totally agree that there should be only few and small/safe change between them.

If the idea is for them to be identical, I agree, we shouldn't do the backport, but I think we should target the other and I think this commit fit that cases: a safe(tested for 5 mounts) and small bug fix.

Should we bring this organization question to the mailing list? Anyway, I trust you on this @charris. The mailing list will probably take too much time for this release anyway.

@charris
Member
charris commented Feb 7, 2013

@nouiz Because the release is about 9 months late and we need to get it out.

@rgommers
Member
rgommers commented Feb 7, 2013

@nouiz in the end, the decision to include a given patch or not is up to Ondrej as release manager.

I've always attempted to keep final identical to the last RC.

@nouiz
nouiz commented Feb 7, 2013

ok i didn't know it was a constraint you tried to enforce.

@charris
Member
charris commented Feb 7, 2013

Yes, it is possible to end up introducing as many bugs as fixes. Innocent looking changes can hide little time bombs. The best way to avoid those problems is make sure everything gets tested, which means a release candidate. The release itself ends up being the biggest test and that is why shorter release cycles will help.

@erg
erg commented Feb 7, 2013

+1 for including this patch in the release.

For the LLE algorithm run in a loop in scikit-learn, my max memory usage is 350mb with this patch applied. Without this patch, it grows by 3gb per loop iteration.

@njsmith
Member
njsmith commented Feb 7, 2013

It's just too late to put new non-RC stuff into 1.7.0, unless it turns out we're making some horrible back-compat break or something, 1.7.0 is done, and we've learned a lot about how not to do this kind of release cycle (esp. the 5 months thing), which we will try to improve upon...

@charris: Re: making a list: I made a github milestone for 1.7.1, that we can use to tag backport-bugs. And I still think we should just release 1.8.0 within the next few months, we already have plenty of changes stacked up, including this one.

@mrjbq7, @nouiz, @erg: We do hear you. It depends on @certik, who might want a well-earned break, but in principle there's no reason that we couldn't release 1.7.1 very shortly after 1.7 (within a few weeks?), if people are willing to do the work of finding the clearly backport-able bug fixes and backporting them. If this bug is affecting you badly, then two options are two just use master as a temporary measure, or to help sort through master's changelog to identify other fixes that need backporting.

@mrjbq7
mrjbq7 commented Feb 7, 2013

@njsmith Okay, I understand the need to release without additional delays. I am able to workaround it - once I figured out what the problem was, that part is easy. Mostly, I was thinking of other users who want to use features in scikit-learn that are broken by this bug, and might not take the effort to figure out why / or to workaround it like I did.

I don't know what your process is, but if you need volunteers to help identify or backport fixes, I wouldn't mind helping.

Thanks,

@njsmith
Member
njsmith commented Feb 7, 2013

numpy-the-project has been a bit disorganized and short on personpower in
recent years, so we're not sure what our process is either - we're trying
to get more organized and sort these things out now :-).

The work needed to make a point release is for people to:

  1. Look through the changes made to master since maintenance/1.7.x was
    branched off of it, and identify which ones are simple bug fixes and have
    not already been backported.
  2. Backport those changes to the maintenance/1.7.x branch (with suitable
    release notes added)

So I guess if you're up for helping, then you could do (1), and for each
such change you discover file a bug saying "this/these commits need
backporting!" (and tag it with the 1.7 backport milestone, if you're
allowed to do that... not sure what github's rules for setting milestones
are). And if you're looking to be extra super helpful, you could go ahead
and do the backports, and submit them as PRs against maintenanace/1.7.x
(again please to careful to highlight that that's what you're doing in the
PR description). We'll wait until after 1.7.0-final is released before we
actually apply any such PRs, but that should be within the week or so.

Some useful commands:

show all commits in master since 1.7.x was branched off:

git log origin/maintenance/1.7.x..origin/master

show all commits in 1.7.x since it branched off master:

git log origin/master..origin/maintenance/1.7.x

Thanks!

On Thu, Feb 7, 2013 at 2:17 PM, John Benediktsson
notifications@github.comwrote:

@njsmith https://github.com/njsmith Okay, I understand the need to
release without additional delays. I am able to workaround it - once I
figured out what the problem was, that part is easy. Mostly, I was thinking
of other users who want to use features in scikit-learn that are broken by
this bug, and might not take the effort to figure out why / or to
workaround it like I did.

I don't know what your process is, but if you need volunteers to help
identify or backport fixes, I wouldn't mind helping.

Thanks,


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/2969#issuecomment-13264433.

@certik
Collaborator
certik commented Feb 8, 2013

It would be nice to discuss this on the mailinglist to get more feedback.

I looked at the fix and it does look innocent to me. But I am not going to include this in 1.7.0, because what is even more important is to finally release and make sure we don't screw up.

I have no problem to release 1.7.1 soon, I have all the scripts ready. I agree that taking 0.5 year is unacceptable.
The reason it took so long is that I was graduating my Ph.D. in the last 0.5 year, e.g. writing a thesis, defending, finishing up two articles... and applying for jobs/flying for interviews, also my son was just born, so I've been really busy. And also some of the bugs were quite nasty, and it took me a few days of full time work to nail them down and fix them. However, this is now done and I've also become more experienced with everything, so the next time will be faster.

Anyway, my apologies for that. As @njsmith said, if you would like to help us out @mrjbq7, we would highly appreciate it. You can help with sending PRs, etc. How exactly do you use NumPy? Do you install the release tarball using setup.py? (If so, then you can easily just use the master version for now.) Or some other way.

@charris
Member
charris commented Feb 8, 2013

@certik Ondrej, I certainly didn't mean to imply that it was "all your fault". There were a number of events that came together to leave things in a disorganized state, and I suspect you got dumped into the middle without being completely apprised of the situation. If you were informed, well, that's another matter ;-)

@mrjbq7
mrjbq7 commented Feb 8, 2013

@certik I certainly appreciate your and other contributors hard work! My job is just to champion for the bugs and encourage fixing when possible. I look forward to helping out with 1.7.1 if I can over the next few weeks.

Best,

@certik
Collaborator
certik commented Feb 8, 2013

@charris, I am now aware of any events. Anyway, this issue is scheduled for backport, so I'll do it once 1.7.0 is out in a few days.

@charris charris added a commit to charris/numpy that referenced this issue Feb 12, 2013
@charris charris BUG: gh-2969: Backport memory leak fix 80b3a34. 3819bf8
@charris
Member
charris commented Feb 12, 2013

I put up a PR to backport this.

@erg
erg commented Feb 26, 2013

How about a 1.7.1 release? Scikit users are wasting time on old bugs. See previous link...

@njsmith
Member
njsmith commented Mar 2, 2013

The fix is merged into both maintenance/1.7.x and master, we're just waiting for 1.7.1 now, so I'm closing this to get it off the todo-for-1.7.x list.

@njsmith njsmith closed this Mar 2, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment