Very slow traceback construction from Cython extension #1317

Closed
wesm opened this Issue Jan 24, 2012 · 18 comments

Comments

Projects
None yet
3 participants

wesm commented Jan 24, 2012

I think this is the fault of IPython and I'm having a hard time coming up with a self-contained example, but basically I'm noticing a major lag between statement execution and IPython coming back with the formatted traceback when the exception is generated in Cython code. For example, with pandas git master (99e2eec, because I'm about to fix this bug), the following code takes about 8.5 seconds to play out:

from pandas import Series
s = Series({'T1': 5})
print s.groupby({'T1': 'T2'}).agg(sum)

Running this code

import time
start = time.time()
print s.groupby({'T1': 'T2'}).agg(sum)
time.time() - start

gives me

In [10]: time.time() - start
Out[10]: 8.473900079727173

By contrast, running the same code in the standard Python interpreter gives the stack trace virtually instantaneously.

Any ideas what's wrong? This same problem seems to be present in IPython git master, 0.12, and 0.11.

Owner

takluyver commented Jan 24, 2012

Installing pandas in my virtualenv seems to mess up my sys.path so that I get thrown back to IPython 0.11 (which I don't normally get with or without the virtualenv).

There's quite a lot of code to produce those tracebacks. If I switch to plain exception mode (using %xmode) they appear instantly, so it must be something going wrong in ultratb.

Owner

takluyver commented Jan 24, 2012

Installing pandas in my virtualenv adds these two lines to $VIRTUAL_ENV/lib/python2.7/site-packages/easy-install.pth, throwing the order of things on sys.path off:

/usr/lib/pymodules/python2.7
/usr/lib/python2.7/dist-packages

Do you want me to file a bug for this?

Owner

minrk commented Jan 24, 2012

What Cython/Python/etc.? I run this code with 99e2eec and get TypeError: unhashable type 'dict' immediately.

OSX 10.7 System Python 2.7.1
cython/cython@3d13ae3
wesm/pandas@99e2eec

wesm commented Jan 24, 2012

@takluyver is that pandas-related? Must be something weird going wrong in setuptools.

@minrk I'll try a newer Python. This is Python 2.7.2 on 64-bit Ubuntu

Owner

takluyver commented Jan 24, 2012

@wesm : I suspected ez_setup, but commenting that out from setup.py didn't seem to make a difference. I wonder if it's also related to Cython.

I replicated the slow traceback with Python 2.7.2 on 32-bit Ubuntu.

wesm commented Jan 24, 2012

Just updated to Cython git master and seems be OK now. Unclear why the behavior was only manifesting in the ultratb. Closing the issue (unless you guys want to investigate further)

@wesm wesm closed this Jan 24, 2012

Owner

minrk commented Jan 25, 2012

Thanks - just for reference, what Cython version were you using before you updated? Could be that they had some actions that are triggered by inspection for better tracebacks, but require some work.

wesm commented Jan 25, 2012

Appears to be a problem with 262bc9f and others around that time frame. Could do a git bisect to see when it got better

wesm commented Jan 25, 2012

git bisect yielded:

0e579823bd34de5d1d9b4aeac2c8d727415cba2d is the first bad commit
commit 0e579823bd34de5d1d9b4aeac2c8d727415cba2d
Author: Mark Florisson <markflorisson88@gmail.com>
Date:   Sat Dec 24 23:01:25 2011 +0100

    Lower buffer_max_dims to 8

:040000 040000 5cef283c2e0a364154c99aa9876e118df137798f 4cb82498c330d2c45876bae61fc39447d0ff9941 M  Cython

I just verified that the bug is present in Cython 4b9a1ff and not in 0e57982. Maybe we should raise the issue with them so that they know it was a problem (in case it was fixed by accident)? It's strange because the change in 0e57982 is so tiny.

@wesm wesm reopened this Jan 29, 2012

wesm commented Jan 29, 2012

So I scripted IPython to get a profiling of what's going on and it's extremely perplexing:

11:58 ~/Dropbox/book/svn $ python -m cProfile -s cumulative foo.py 
         8265151 function calls (8261576 primitive calls) in 12.425 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.427   12.427 foo.py:1(<module>)
        3    0.000    0.000   11.816    3.939 shell.py:118(process_input_line)
        3    0.000    0.000   11.815    3.938 interactiveshell.py:2358(run_cell)
        3    0.000    0.000   11.813    3.938 interactiveshell.py:2447(run_ast_nodes)
        3    0.000    0.000   11.813    3.938 interactiveshell.py:2511(run_code)
        1    0.000    0.000   11.781   11.781 interactiveshell.py:1594(showtraceback)
        1    0.000    0.000   11.781   11.781 ultratb.py:1153(structured_traceback)
        1    0.001    0.001   11.781   11.781 ultratb.py:1061(structured_traceback)
        1    2.146    2.146   11.780   11.780 ultratb.py:674(structured_traceback)
  3097026    3.163    0.000    9.454    0.000 tokenize.py:264(generate_tokens)
  3097286    6.095    0.000    6.095    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
        1    0.000    0.000    0.400    0.400 shell.py:70(__init__)
        2    0.000    0.000    0.397    0.199 configurable.py:287(instance)
        1    0.000    0.000    0.396    0.396 interactiveshell.py:399(__init__)
        1    0.000    0.000    0.364    0.364 interactiveshell.py:1436(init_history)
        1    0.000    0.000    0.364    0.364 history.py:396(__init__)
        1    0.000    0.000    0.297    0.297 history.py:93(__init__)
        5    0.297    0.059    0.297    0.059 {method 'execute' of 'sqlite3.Connection' objects}
        1    0.000    0.000    0.297    0.297 history.py:151(init_db)
        1    0.003    0.003    0.211    0.211 shell.py:4(<module>)
        1    0.002    0.002    0.113    0.113 __init__.py:99(<module>)
   703160    0.112    0.000    0.112    0.000 {method 'span' of '_sre.SRE_Match' objects}
        1    0.001    0.001    0.103    0.103 rcsetup.py:14(<module>)
        3    0.001    0.000    0.094    0.031 __init__.py:6(<module>)

hmm, the plot thickens. so something appears to be going very wrong inside generate_tokens. anyone with clues? I feel like we ought to figure this one out

wesm commented Jan 29, 2012

More: generate_tokens yields about 2.6 million tokens before raising an exception


12:18 ~/Dropbox/book/svn $ python -m pdb foo.py 
> /home/wesm/Dropbox/book/svn/foo.py(1)<module>()
-> from ipybook.shell import IPythonShell
(Pdb) b IPython/core/ultratb:838, 'tseries' in file
Breakpoint 1 at /home/wesm/code/ipython/IPython/core/ultratb.py:838
(Pdb) 
Breakpoint 2 at /home/wesm/code/ipython/IPython/core/ultratb.py:838
(Pdb) c
> /home/wesm/code/ipython/IPython/core/ultratb.py(838)structured_traceback()
-> name_cont = False
gen = generate_tokens(linereader)
gen.next()
(52, '\x00', (1, 0), (1, 1), '\x00\x00\x00\x00\x00 \t3\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\xc2\x03\x12\x00\x00\x00\x00\x00(\t3\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00P\xa7\n')
gen.next()
(52, '\x00', (1, 1), (1, 2), '\x00\x00\x00\x00\x00 \t3\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\xc2\x03\x12\x00\x00\x00\x00\x00(\t3\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00P\xa7\n')
gen.next()
(52, '\x00', (1, 2), (1, 3), '\x00\x00\x00\x00\x00 \t3\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\xc2\x03\x12\x00\x00\x00\x00\x00(\t3\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00P\xa7\n')
gen.next()
(52, '\x00', (1, 3), (1, 4), '\x00\x00\x00\x00\x00 \t3\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\xc2\x03\x12\x00\x00\x00\x00\x00(\t3\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00P\xa7\n')
gen.next()
(52, '\x00', (1, 4), (1, 5), '\x00\x00\x00\x00\x00 \t3\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\xc2\x03\x12\x00\x00\x00\x00\x00(\t3\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00P\xa7\n')
results = []
while True: results.append(gen.next())
*** TokenError: ('EOF in multi-line statement', (38697, 0))
len(results)
2580658

wesm commented Jan 29, 2012

OK so I think this is a bug in IPython. No attempt to tokenize a binary .so or .pyd file should ever be made. After updating Cython, by some dumb luck, this happens on the 48th iteration of the generate_tokens generator


> /home/wesm/code/repos/ipython/IPython/core/ultratb.py(841)structured_traceback()
-> print >> sys.stderr, file, lnum

/home/wesm/code/pandas/pandas/_tseries.so 27
> /home/wesm/code/repos/ipython/IPython/core/ultratb.py(842)structured_traceback()
-> print >> sys.stderr, i
n
47
> /home/wesm/code/repos/ipython/IPython/core/ultratb.py(844)structured_traceback()
-> if token_type == tokenize.NAME and token not in keyword.kwlist:
l
839     
840                     for i, (token_type, token, start, end, line) in enumerate(generate_tokens(linereader)):
841                         print >> sys.stderr, file, lnum
842                         print >> sys.stderr, i
843                         # build composite names
844  ->                     if token_type == tokenize.NAME and token not in keyword.kwlist:
845                             if name_cont:
846                                 # Continuation of a dotted name
847                                 try:
848                                     names[-1].append(token)
849                                 except IndexError:
token_type
4
n
> /home/wesm/code/repos/ipython/IPython/core/ultratb.py(860)structured_traceback()
-> elif token == '.':
n
> /home/wesm/code/repos/ipython/IPython/core/ultratb.py(862)structured_traceback()
-> elif token_type == tokenize.NEWLINE:
l
857                                 # to do (or the caller can print a list with repeated
858                                 # names if so desired.
859                                 names.append([token])
860                         elif token == '.':
861                             name_cont = True
862  ->                     elif token_type == tokenize.NEWLINE:
863                             break
864     
865                 except (IndexError, UnicodeDecodeError):
866                     # signals exit of tokenizer
867                     pass
n
> /home/wesm/code/repos/ipython/IPython/core/ultratb.py(863)structured_traceback()
-> break
l
858                                 # names if so desired.
859                                 names.append([token])
860                         elif token == '.':
861                             name_cont = True
862                         elif token_type == tokenize.NEWLINE:
863  ->                         break
864     
865                 except (IndexError, UnicodeDecodeError):
866                     # signals exit of tokenizer
867                     pass
868                 except tokenize.TokenError,msg:
token_type
4
token
'\n'
start
(1, 47)
end
(1, 48)
line
'\x00\x00\x00\x00\x00`\x083\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\xc2\x03\x12\x00\x00\x00\x00\x00h\x083\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00P\xa7\n'

any proposed solutions?

Owner

takluyver commented Jan 29, 2012

We call tokenize when generating the tracebacks, both for the coloured
source, and for the variable names on the relevant line for the 'context'
traceback mode.

I'd guess it's trying to parse the compiled file, rather than the source,
and falling down. But it doesn't raise an exception until it's done a load
of parsing. I had a similar problem before, but it blew up instantly. We
probably should figure out by some means that it's looking at a compiled
file, and either look up the pyx file, or skip it altogether (I don't know
if tokenize will work on Cython code, and of course it may not be present).

Relevant code is around here:
https://github.com/ipython/ipython/blob/master/IPython/core/ultratb.py#L827

Owner

takluyver commented Jan 29, 2012

Assigning this to myself, since I've touched ultraTB most recently.

@ghost ghost assigned takluyver Jan 29, 2012

wesm commented Jan 29, 2012

Probably won't work with Cython code, because it's part of the Python standard library. Any binary file should probably just be skipped

Owner

takluyver commented Jan 29, 2012

Are there other valid extensions for binary files with Python functions, beyond .so and .pyd? Is .dll possible on Windows?

Owner

takluyver commented Jan 29, 2012

See PR #1341

@takluyver takluyver closed this in 9fbeb53 Jan 29, 2012

fperez added a commit that referenced this issue Jan 29, 2012

Merge pull request #1341 from takluyver/i1317
Don't attempt to tokenize binary files for tracebacks.

Previously we had been trying and just catching the exception, but in corner cases the tokenizer can run for several seconds before raising an exception (#1317). This skips tokenizing if the file has the extension .so, .pyd or .dll.

Closes gh-1317.

wesm commented Jan 29, 2012

Much obliged and thanks as always.

yarikoptic pushed a commit to yarikoptic/ipython that referenced this issue May 2, 2014

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014

Merge pull request #1341 from takluyver/i1317
Don't attempt to tokenize binary files for tracebacks.

Previously we had been trying and just catching the exception, but in corner cases the tokenizer can run for several seconds before raising an exception (#1317). This skips tokenizing if the file has the extension .so, .pyd or .dll.

Closes gh-1317.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment