Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of InputSplitter #10898

Merged
merged 5 commits into from Nov 15, 2017

Conversation

takluyver
Copy link
Member

This fixes some embarrassingly bad performance. @rpep pointed out to me that a 500-line cell he had in a notebook took tens of seconds even to start executing, e.g. to show a syntax error.

I tracked this down to our input processing in IPython.core.inputsplitter. We are doing large amounts of unnecessary work by calculating at each line processed:

  • The indentation for the next line, which involves tokenisation. This was >90% of the time taken.
  • Whether the input is complete, which involves byte-compilation.

We still need the ability to work out both of these things, but we don't need them for each line of a block passed to inputsplitter, and we don't need them at all when transforming code prior to execution.

This code avoids much of this unnecessary work, taking the processing time on the 500-line sample from ~40 seconds to ~0.2 seconds on my computer. The problematic code is roughly O(N^2) in the number of lines, so it's probably only noticeable with very long cells, but with a 40 line sample I measured an improvement from 0.05 s to 0.004 s.

I have changed a couple of methods and attributes on InputSplitter to do this. The automatic API docs do include the module, but I don't really consider it public API, and that code is already convoluted enough that I don't want to clutter it up with compatibility stuff and make it even harder to understand. I checked that ipykernel and rlipython are not using the pieces I changed.

@takluyver
Copy link
Member Author

Up for discussion: should this be backported to 5.x? It's a worthwhile fix if people use very long inputs, but there is some risk that unknown third-party packages are broken by the API changes.

@rpep
Copy link

rpep commented Nov 14, 2017

I should not that I don't habitually write 500 line cells; I only noticed this because I'm doing some autogeneration work!

@Carreau
Copy link
Member

Carreau commented Nov 14, 2017

can you make indent_space a property (read_only) to change less the API if you are concern ?

Copy link
Member

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @takluyver. A suggestion on a comment (to confirm my understanding) and a small question.

# indentation level, in order to provide auto-indent facilities.
indent_spaces = 0
# A cache for calculating the current indentation.
# If the first value matches self.source, the second value is an integer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps:

# A cache for storing the current indentation
# The first value stores the most recently processed source input
# The second value is the number of spaces for the current indentation 
# If self.source matches the first value, the second value is a valid current indentation.
# If self.source does not match the first value (the cached source), the indentation
# must be recalculated.

if not self.within_python_line:
line = self.assemble_logical_lines.push(line)
if line is None:
return _accumulating('acc logical line')

return _accumulating('acc logical line')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takluyver Is 'acc logical line' just representing that the line has been assembled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indicates that the thing that swallowed this line without returning something is the logical line accumulator. These values are only used for debugging, which thankfully we haven't had to do very often.

@takluyver
Copy link
Member Author

Thanks @Carreau @willingc - I have updated this following your suggestions.

Copy link
Member

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @takluyver

@Carreau
Copy link
Member

Carreau commented Nov 15, 2017

Let's try that ! Thanks @takluyver !

@Carreau Carreau merged commit 8c7ab85 into ipython:master Nov 15, 2017
@takluyver
Copy link
Member Author

What do we think about backporting this to 5.x? I think I lean slightly towards doing so, but I don't have a strong preference.

@Carreau Carreau added this to the 5.6 milestone Nov 15, 2017
@Carreau
Copy link
Member

Carreau commented Nov 15, 2017

Let's see how hard it is.

@meeseeksdev backport to 5.x

@lumberbot-app
Copy link
Contributor

lumberbot-app bot commented Nov 15, 2017

There seem to be a conflict, please backport manually

@lumberbot-app lumberbot-app bot added the Still Needs Manual Backport Added My MrMeeseeks when a backport fails. Help by backporting it, solving conflicts, send PR. label Nov 15, 2017
@Carreau
Copy link
Member

Carreau commented Nov 15, 2017

The conflict seem a bit long:

diff --cc IPython/core/inputsplitter.py
index ac14747d6,337b27d7a..000000000
--- a/IPython/core/inputsplitter.py
+++ b/IPython/core/inputsplitter.py
@@@ -307,7 -414,6 +306,10 @@@ class InputSplitter(object)
          if source.endswith('\\\n'):
              return False
  
++<<<<<<< HEAD
 +        self._update_indent(lines)
++=======
++>>>>>>> 8c7ab85c0... Merge pull request #10898 from takluyver/inputsplitter-indents
          try:
              with warnings.catch_warnings():
                  warnings.simplefilter('error', SyntaxWarning)
@@@ -383,55 -489,20 +385,72 @@@
          # General fallback - accept more code
          return True
  
++<<<<<<< HEAD
 +    #------------------------------------------------------------------------
 +    # Private interface
 +    #------------------------------------------------------------------------
 +
 +    def _find_indent(self, line):
 +        """Compute the new indentation level for a single line.
 +
 +        Parameters
 +        ----------
 +        line : str
 +          A single new line of non-whitespace, non-comment Python input.
 +
 +        Returns
 +        -------
 +        indent_spaces : int
 +          New value for the indent level (it may be equal to self.indent_spaces
 +        if indentation doesn't change.
 +
 +        full_dedent : boolean
 +          Whether the new line causes a full flush-left dedent.
 +        """
 +        indent_spaces = self.indent_spaces
 +        full_dedent = self._full_dedent
 +
 +        inisp = num_ini_spaces(line)
 +        if inisp < indent_spaces:
 +            indent_spaces = inisp
 +            if indent_spaces <= 0:
 +                #print 'Full dedent in text',self.source # dbg
 +                full_dedent = True
 +
 +        if line.rstrip()[-1] == ':':
 +            indent_spaces += 4
 +        elif dedent_re.match(line):
 +            indent_spaces -= 4
 +            if indent_spaces <= 0:
 +                full_dedent = True
 +
 +        # Safety
 +        if indent_spaces < 0:
 +            indent_spaces = 0
 +            #print 'safety' # dbg
 +
 +        return indent_spaces, full_dedent
 +
 +    def _update_indent(self, lines):
 +        for line in remove_comments(lines).splitlines():
 +            if line and not line.isspace():
 +                self.indent_spaces, self._full_dedent = self._find_indent(line)
++=======
+     def get_indent_spaces(self):
+         sourcefor, n = self._indent_spaces_cache
+         if sourcefor == self.source:
+             return n
+ 
+         # self.source always has a trailing newline
+         n = find_next_indent(self.source[:-1])
+         self._indent_spaces_cache = (self.source, n)
+         return n
+ 
+     # Backwards compatibility. I think all code that used .indent_spaces was
+     # inside IPython, but we can leave this here until IPython 7 in case any
+     # other modules are using it. -TK, November 2017
+     indent_spaces = property(get_indent_spaces)
++>>>>>>> 8c7ab85c0... Merge pull request #10898 from takluyver/inputsplitter-indents
  
      def _store(self, lines, buffer=None, store='source'):
          """Store one or more lines of input.

If you're motivated to backport feel free to do so. I'm willing to have the "Needs manual backport" handle by people who actually care. That is to say, we, as a core team do not spend time backporting manually. That's consistent with our timeline.

@takluyver takluyver modified the milestones: 5.6, 6.3 Nov 15, 2017
@takluyver
Copy link
Member Author

OK, I've marked this as 6.3. It was like that for years already, so I doubt many people are hitting it.

@takluyver takluyver deleted the inputsplitter-indents branch November 15, 2017 15:38
@takluyver
Copy link
Member Author

It might be a neat feature for meeseeks if it could display the conflict - or possibly a summary of it if it's big.

@Carreau
Copy link
Member

Carreau commented Nov 15, 2017

Yeah, I though about that, the secret command is git diff --diff-filter=U after the fail cherry-pick. I might do that at some point.

@codypiersall
Copy link

@takluyver I ran into this issue today, and this fixed it. Thank you! Now it doesn't take forever to load the UI for a project I am working on!

@takluyver takluyver removed the Still Needs Manual Backport Added My MrMeeseeks when a backport fails. Help by backporting it, solving conflicts, send PR. label Apr 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants