# Summary

Toying around with a custom pdb class for language model-assisted debugging.

TODO

- [x] test prompt in playground (maybe exclude the "full source" kwarg?)
- [x] port prompt to yaml file
- [x] enable load_prompt/kwargs etc in LMdb init
- [x] consider how we filter locals and globals (currently filter out everything w/ a leading underscore and also do some rather clumsy filtering to make sure global is used in script. But might be able to do better here.)
- [x] consider whether to rm some fields (header, globals, code_full) from get_prompt_kwargs method OR include them in prompt
- consider if there's a good way to make this more conversational in case we need to ask multiple questions. If we just print gpt's response, this won't work so well. Could try to revise this to fit into ConvManager paradigm.
- consider tweaking prompt to use proxy/authority (e.g. "Answer Key")
- consider adding option for "I don't know"
    - Or maybe something like "If you don't know what's causing the bug, say "I don't know". Then write a list of 5 plausible causes that the developer can check for when debugging." (take advantage of its strength at generating list, thinking of possibilities we might not)
- consider how to handle huge data structures (big df, long list, etc.)
~ - See if we can get this to work like ipdb where you can call it only AFTER an error occurs.
- hide user warning about using codex model name.
- debug slowness when using magic (is it calling query multiple times?)
~ - add option to add new cell w/ gpt-fixed function below (may need to adjust prompt a bit to encourage it to provide this)

UPDATE: Something weird going on here. Openai response sometimes looks normal, sometimes very weird (like function was called many times repeatedly - maybe some multiproc/multithreaded thing happening?). When I tried hardcoding other backends (search "partial" or see DebugMagic.lmdb method), the reply appears to be empty. However, the global var `_roboduck_last_completion` gets updated with the expected response. Might be related to the sys.displayhook usage in the self.shell.debugger call (uncomment the source.getlines calls in the DebugMagic.lmdb method).

UPDATE 2: sometimes just need to restart kernel. mock/repeat backends now work as expected.

- maybe update prompt(s) to indicate that we are inside a debugger? Otherwise it might be confusing -  if all locals are params, it might seem like we're just telling gpt3 the args.
    - should we be passing in 1 code snippet but a whole sequence of states? That might be better.
- Think more about whether main use case is error explanation (in which case customized stack trace like pretty_errors might make more sense), natural language debugging (in which case we want to focus more on the conversational/sequential nature, maintain series of states, etc.), or static analysis (in which case a jupyter extension or magic that lets us type questions might be ideal).

NOTES

Considerations on how to enter qa mode:

Option 1. Launch some sort of repl here, then let the user type
natural language questions until they want to exit. This would be
nice but maybe a bit tricky - seems like pdb may use toolkit already
because using prompt here throws an error indicating we're already
in an event loop.

Option 2: prefix every question with "chat" or some command "Q:".
Have to check if that's possible.

Option 3: try to override default action selection so that if we
type something that looks like natural language rather than a couple
variable names (maybe something ending in or containing a question 
mark) we query gpt instead of trying to eval vars.

In [1]:
import cmd
from collections.abc import Iterable
from collections import deque
from colorama import Fore, Style
from contextlib import redirect_stdout
import hashlib
import inspect
from IPython.display import display, Javascript
from IPython.core.magics import NamespaceMagics
from IPython.core.magic import cell_magic, line_cell_magic, line_magic, \
    magics_class, Magics, no_var_expand
from IPython.core.magic_arguments import argument, magic_arguments, \
    parse_argstring
import ipynbname
import pandas as pd
from pdb import Pdb
from prompt_toolkit import prompt
import pyperclip
import sys
import time
import torch
from transformers import AutoTokenizer
import warnings

from htools import *
from jabberwocky.openai_utils import GPT, load_prompt, PromptManager, \
    GPTBackend, EngineMap

Object loaded from /Users/hmamin/jabberwocky/data/misc/gooseai_sample_responses.pkl.


In [2]:
def save_notebook(file_path):
    """Adapted from
    https://stackoverflow.com/questions/32237275/save-an-ipython-notebook-programmatically-from-within-itself/57814673#57814673
    """
    def file_md5(path):
        with open(path, 'rb') as f:
            text = f.read()
        return hashlib.md5(text).hexdigest()
    
    start_md5 = file_md5(file_path)
    display(Javascript('IPython.notebook.save_checkpoint();'))
    current_md5 = start_md5
    
    while start_md5 == current_md5:
        time.sleep(1)
        current_md5 = file_md5(file_path)

In [3]:
# Adapted from cli.ReadmeUpdater method.
def load_ipynb(path, save_if_self=True):
    """Loads ipynb and formats cells into 1 big string.

    Parameters
    ----------
    path: Path

    Returns
    -------
    str
    """
    if save_if_self:
        try:
            self_path = ipynbname.path()
        except FileNotFoundError:
            pass
        else:
            if self_path == path:
                save_notebook(path)

    with open(path, 'r') as f:
        cells = json.load(f)['cells']
        
    cell_str = ''
    for cell in cells:
        if not cell['source']: continue
        source = '\n' + ''.join(cell['source']) + '\n'
        if cell['cell_type'] == 'code':
            source = '\n```' + source + '```\n'
        cell_str += source
    return cell_str

In [4]:
def colored(text, color):
    color = getattr(Fore, color.upper())
    return f'{color}{text}{Style.RESET_ALL}'

In [5]:
def truncated_repr(obj, max_len=79):
    """Return an object's repr, truncated to ensure that it doesn't take up
    more characters than we want. This is used to reduce our chances of using
    up all our available tokens in a gpt prompt simply communicating that a
    giant data structure exists, e.g. list(range(1_000_000)). Our use
    case doesn't call for anything super precise so the max_len should be 
    thought of as more of guide than an exact max. I think it's enforced but I
    didn't put a whole lot of thought or effort into confirming that.
    
    Parameters
    ----------
    obj: any
    max_len: int
    
    Returns
    -------
    str: Repr for obj, truncated to approximately max_len characters or fewer.
    When possible, we insert ellipses into the repr to show that truncation
    occurred. Technically there are some edge cases we don't handle (e.g. if
    obj is a class with an insanely long name) but that's not a big deal, at
    least at the moment. I can always revisit that later if necessary.
    """
    def qualname(obj):
        """Similar to type(obj).__qualname__() but that method doesn't always
        include the module(s). e.g. pandas Index has __qualname__ "Index" but
        this funnction returns "<pandas.core.indexes.base.Index>".
        """
        text = str(type(obj))
        names = re.search("<class '([a-zA-Z_.]*)'>", text).groups()
        assert len(names) == 1, f'Should have found only 1 qualname but '\
            f'found: {names}'
        return f'<{names[0]}>'
      
    open2close = {
        '[': ']',
        '(': ')',
        '{': '}'
    }
    repr_ = repr(obj)
    if len(repr_) < max_len:
        return repr_
    if isinstance(obj, pd.DataFrame):
        cols = truncated_repr(obj.columns.tolist(), max_len - 26)
        return f'pd.DataFrame(columns=' \
            f'{truncated_repr(cols, max_len - 22)})'
    if isinstance(obj, pd.Series):
        return f'pd.Series({truncated_repr(obj.tolist(), max_len - 11)})'
    if isinstance(obj, dict):
        length = 5
        res = ''
        for k, v in obj.items():
            if length >= max_len - 2:
                break
            new_str = f'{k!r}: {v!r}, '
            length += len(new_str)
            res += new_str
        return "{" + res.rstrip() + "...}"
    if isinstance(obj, str):
        return repr_[:max_len - 4] + "...'"
    if isinstance(obj, Iterable):
        # A bit risky but sort of elegant. Just recursively take smaller
        # slices until we get an acceptable length. We may end up going
        # slightly over the max length after adding our ellipses but it's
        # not that big a deal, this isn't meant to be super precise. We
        # can also end up with fewer items than we could have fit - if we
        # exhaustively check every possible length one by one until we 
        # find the max length that fits, we can get a very slow function
        # when inputs are long.
        # Can't easily pass smaller max_len value into recursive call 
        # because we always want to compare to the user-specified value.
        n = int(max_len / len(repr_) * len(obj))
        if n == len(obj):
            # Even slicing to just first item is too long, so just revert
            # to treating this like a non-iterable object.
            return qualname(obj)
        # Need to slice set while keeping the original dtype.
        if isinstance(obj, set):
            slice_ = set(list(obj)[:n])
        else:
            slice_ = obj[:n]
        repr_ = truncated_repr(slice_, max_len)
        non_brace_idx = len(repr_) - 1 
        while repr_[non_brace_idx] in open2close.values():
            non_brace_idx -= 1
        if non_brace_idx <= 0 or (non_brace_idx == 3
                                  and repr_.startswith('set')):
            return repr_[:-1] + '...' + repr_[-1]
        return repr_[:non_brace_idx+1] + ',...' + repr_[non_brace_idx+1:]
    
    # We know it's non-iterable at this point.
    if isinstance(obj, type):
        return f'<class {obj.__name__}>'
    if isinstance(obj, (int, float)):
        return truncated_repr(format(obj, '.3e'), max_len)
    return qualname(obj)

In [6]:
def type_annotated_dict_str(dict_, func=repr):
    """String representation of a dict, where each line includes an inline
    comment showing the type of the value.
    """
    type_strs = [f'\n    {func(k)}: {func(v)},   # type: {type(v).__name__}'
                 for k, v in dict_.items()]
    return '{' + ''.join(type_strs) + '\n}'

In [7]:
class CodeCompletionCache:
    
    last_completion = ''

In [34]:
ROBODUCK_GPT = GPTBackend(log_stdout=False)
PROMPT_MANAGER = PromptManager(['debug', 'debug_full'],
                               verbose=False, 
                               gpt=ROBODUCK_GPT)

In [53]:
# class RoboDuckDB(Pdb):
    
#     def __init__(self, *args, backend='openai', model=None, 
#                  full_context=False, log=False, max_len_per_var=79,
#                  **kwargs):
#         """
#         max_len_per_var: int
#             Limits number of characters per variable when communicating 
#             current state (local or global depending on `full_context`) to 
#             gpt. If unbounded, that section of the prompt alone could grow
#             very big . I somewhat arbitrarily set 79 as the default, i.e. 
#             1 line of python per variable. I figure that's usually enough to
#             communicate the gist of what's happening.
#         """
#         super().__init__(*args, **kwargs)
#         self.prompt = '>>> '
#         self.duck_prompt = '[RoboDuck] '
#         self.gpt = GPTBackend(log_stdout=False)
#         # TODO: this does seem to remove the handler from handlers but their
#         # must be some other trace of it because we still log to stdout.
#         self.gpt.handlers = [handler for handler in self.gpt.logger.handlers 
#                              if 'stdout' not in str(handler)]
#         self.query_kwargs = load_prompt(
#             'debug_full' if full_context else 'debug', 
#             verbose=False
#         )
#         self.prompt_template = self.query_kwargs.pop('prompt')
#         if model is not None:
#             self.query_kwargs['model'] = model
#         self.backend = backend
#         self.full_context = full_context
#         self.log = log
#         self._last_completion = ''
#         self.repr_func = partial(truncated_repr, max_len=max_len_per_var)
    
#     def _get_prompt_kwargs(self):
#         res = {}
        
#         # Get current code snippet.
#         try:
#             res['code'] = inspect.getsource(self.curframe)
#         except OSError as err:
#             self.error(err)
#         res['local_vars'] = type_annotated_dict_str(
#             {k: v for k, v in self.curframe_locals.items() 
#              if not is_ipy_name(k)},
#             self.repr_func
#         )
            
#         # Get full source code if necessary.
#         if self.full_context:            
#             # File is a string, either a file name or something like
#             # <ipython-input-50-e97ed612f523>.
#             file = inspect.getsourcefile(self.curframe.f_code)
#             if file.startswith('<ipython'):
#                 res['full_code'] = load_ipynb(ipynbname.path())
#                 res['file_type'] = 'jupyter notebook'
#             else:
#                 res['full_code'] = load(file, verbose=False)
#                 res['file_type'] = 'python script'
#             used_tokens = set(res['full_code'].split())
#         else:   
#             # This is intentionally different from the used_tokens line in the
#             # if clause - we only want to consider local code here.
#             used_tokens = set(res['code'].split())
            
#         # TODO: code.split() might not work so well in some cases.
#         # Namespace is often polluted with lots of unused globals (htools is
#         # very much guilty of this 😬) and we don't want to clutter up the 
#         # prompt with these.
#         res['global_vars'] = type_annotated_dict_str(
#             {k: v for k, v in self.curframe.f_globals.items() 
#              if k in used_tokens and not is_ipy_name(k)},
#             self.repr_func
#         )
#         return res

#     def onecmd(self, line):
#         """Interpret the argument as though it had been typed in response
#         to the prompt.
#         Checks whether this line is typed at the normal prompt or in
#         a breakpoint command list definition.
#         """
#         if not self.commands_defining:
#             if '?' in line:
#                 return self.ask_language_model(line)
#             return cmd.Cmd.onecmd(self, line)
#         else:
#             return self.handle_command_def(line)
        
#     def ask_language_model(self, question):
#         # TODO: maybe should reconstruct each time q is asked? State changes,
#         # that's the whole point of this debugger.
#         prompt_kwargs = self._get_prompt_kwargs()
#         prompt = self.prompt_template.format(question=question, 
#                                              **prompt_kwargs)
#         if len(prompt.split()) > 1_000:
#             warnings.warn(
#                 'Prompt is very long (>1k words). You\'re approaching a risky'
#                 ' zone where your prompt + completion might exceed the max '
#                 'sequence length.'
#             )
#         # TODO rm
#         print(colored(prompt, 'red'))
        
#         # TODO: could we somehow use convmanager here? Given that I envisioned
#         # this as a conversation with the kernel/interpreter/script/something.
#         # TODO: maybe add option in gpt.query to avoid printing to stdout. For
#         # now, just use redirect_stdout here to see what result will look 
#         # like.
#         # TODO: temporarily disabled logging.
#         print(colored(self.duck_prompt, 'green'), end='')
#         res = ''
#         # Suppress jabberwocky auto-warning about codex model name.
#         with warnings.catch_warnings():
#             warnings.filterwarnings('ignore')
#             with self.gpt(self.backend, verbose=False):
#                 for i, (cur, full) in enumerate(self.gpt.query(
#                     prompt, 
#                     **self.query_kwargs, 
#                     log=self.log,
#                     stream=True
#                 )):
#                     if not i and cur.startswith('\n'):
#                         continue
#                     res += cur
#                     for char in cur:
#                         print(colored(char, 'green'), end='')
#                         time.sleep(.02)
#         # Strip trailing quotes because the entire prompt is inside a 
#         # docstring and codex may try to close it. We can't use it as a stop
#         # phrase in case codex generates a fixed code snippet that includes
#         # a docstring.
#         answer = res.strip()
#         if not answer:
#             answer = 'Sorry, I don\'t know. Can you try '\
#                 'rephrasing your question?'
#             print(colored(answer, 'green'))
# #         print(colored(f'{self.duck_prompt} {answer}', 'green'))
        
#         # TODO: when called from magic, ipython seems to delete reference to 
#         # this obj so for now store it as a global var so we can try inserting
#         # a new cell.
#         self._last_completion = answer
#         global _roboduck_last_completion
#         _roboduck_last_completion = answer

class RoboDuckDB(Pdb):
    
    def __init__(self, *args, backend='openai', model=None, 
                 full_context=False, log=False, max_len_per_var=79,
                 **kwargs):
        """
        Once you're in a debugging session, any conversational turn containing
        a question mark will be interepreted as a question for gpt. Prefixing
        your question with "[dev]" will print out the full prompt before
        making the query.
        
        max_len_per_var: int
            Limits number of characters per variable when communicating 
            current state (local or global depending on `full_context`) to 
            gpt. If unbounded, that section of the prompt alone could grow
            very big . I somewhat arbitrarily set 79 as the default, i.e. 
            1 line of python per variable. I figure that's usually enough to
            communicate the gist of what's happening.
        """
        super().__init__(*args, **kwargs)
        self.prompt = '>>> '
        self.duck_prompt = '[RoboDuck] '
        # Check if None explicitly because model=0 is different.
        self.query_kwargs = {'model': model} if model is not None else {}
        self.backend = backend
        self.full_context = full_context
        self.task = 'debug' + '_full'*full_context
        self.log = log
        self.repr_func = partial(truncated_repr, max_len=max_len_per_var)
    
    def _get_prompt_kwargs(self):
        res = {}
        
        # Get current code snippet.
        try:
            res['code'] = inspect.getsource(self.curframe)
        except OSError as err:
            self.error(err)
        res['local_vars'] = type_annotated_dict_str(
            {k: v for k, v in self.curframe_locals.items() 
             if not is_ipy_name(k)},
            self.repr_func
        )
            
        # Get full source code if necessary.
        if self.full_context:            
            # File is a string, either a file name or something like
            # <ipython-input-50-e97ed612f523>.
            file = inspect.getsourcefile(self.curframe.f_code)
            if file.startswith('<ipython'):
                res['full_code'] = load_ipynb(ipynbname.path())
                res['file_type'] = 'jupyter notebook'
            else:
                res['full_code'] = load(file, verbose=False)
                res['file_type'] = 'python script'
            used_tokens = set(res['full_code'].split())
        else:   
            # This is intentionally different from the used_tokens line in the
            # if clause - we only want to consider local code here.
            used_tokens = set(res['code'].split())
            
        # Namespace is often polluted with lots of unused globals (htools is
        # very much guilty of this 😬) and we don't want to clutter up the 
        # prompt with these.
        res['global_vars'] = type_annotated_dict_str(
            {k: v for k, v in self.curframe.f_globals.items() 
             if k in used_tokens and not is_ipy_name(k)},
            self.repr_func
        )
        return res

    def onecmd(self, line):
        """Interpret the argument as though it had been typed in response
        to the prompt.
        Checks whether this line is typed at the normal prompt or in
        a breakpoint command list definition.
        """
        if not self.commands_defining:
            if '?' in line:
                return self.ask_language_model(
                    line, verbose=line.startswith('[dev]')
                )
            return cmd.Cmd.onecmd(self, line)
        else:
            return self.handle_command_def(line)
        
    def ask_language_model(self, question, verbose=False):
        prompt_kwargs = self._get_prompt_kwargs()
        prompt_kwargs['question'] = question
        prompt = PROMPT_MANAGER.prompt(self.task, prompt_kwargs)
        if len(prompt.split()) > 1_000:
            warnings.warn(
                'Prompt is very long (>1k words). You\'re approaching a risky'
                ' zone where your prompt + completion might exceed the max '
                'sequence length.'
            )
        if verbose:
            print(colored(prompt, 'red'))
        
        print(colored(self.duck_prompt, 'green'), end='')
        res = ''
        # Suppress jabberwocky auto-warning about codex model name.
        with warnings.catch_warnings():
            warnings.filterwarnings('ignore')
            with ROBODUCK_GPT(self.backend, verbose=False):
                prev_is_title = False
                for i, (cur, full) in enumerate(PROMPT_MANAGER.query(
                    self.task,
                    prompt_kwargs,
                    **self.query_kwargs, 
                    log=self.log,
                    stream=True
                )):
                    # We do this BEFORE the checks around SOLUTION PART 2
                    # because we don't want to print that line, but we do want
                    # to retain it in our CodeCompletionCache so that our
                    # jupyter magic can easily extract the code portion later.
                    res += cur
                    
                    # Slightly fragile logic - openai currently returns this
                    # in a single streaming step even though the current codex
                    # tokenizer splits it into 5 tokens. If they return this
                    # as multiple tokens, we'd need to change this logic.
                    if cur == 'SOLUTION PART 2':
                        prev_is_title = True
                        continue
                    # Avoid printing the ':' after 'SOLUTION PART 2'. Openai
                    # returns this at a different streaming step.
                    if prev_is_title and cur.startswith(':'):
                        continue
                    prev_is_title = False
                    if not i:
                        cur = cur.lstrip('\n')
                    for char in cur:
                        print(colored(char, 'green'), end='')
                        time.sleep(.02)
        
        # Strip trailing quotes because the entire prompt is inside a 
        # docstring and codex may try to close it. We can't use it as a stop
        # phrase in case codex generates a fixed code snippet that includes
        # a docstring.
        answer = res.strip()
        if not answer:
            answer = 'Sorry, I don\'t know. Can you try '\
                'rephrasing your question?'
            print(colored(answer, 'green'))
        
        # When using the `duck` jupyter magic in "insert" mode, we reference
        # the CodeCompletionCache to populate the new code cell.
        CodeCompletionCache.last_completion = answer

In [51]:
@magics_class
class DebugMagic(Magics):

    @magic_arguments()
    @argument('-i', action='store_true', 
              help='Boolean flag: if provided, INSERT a new code cell with '
                   'the suggested code fix.')
    @line_magic
    def duck(self, line='', cell=None):
        """Silence warnings for a cell. The -p flag can be used to make the
        change persist, at least until the user changes it again.
        """
        args = parse_argstring(self.duck, line)
        cls = self.shell.debugger_cls
        self.shell.debugger_cls = RoboDuckDB
        self.shell.InteractiveTB.debugger_cls = RoboDuckDB
        self.shell.debugger(force=True)
        # Insert suggested code into next cell.
        if args.i and CodeCompletionCache.last_completion:
            *_, code_snippet = CodeCompletionCache.last_completion.split(
                'SOLUTION PART 2:'
            )
            self.shell.set_next_input(code_snippet.lstrip('\n'),
                                      replace=False)
        CodeCompletionCache.last_completion = ''
        self.shell.debugger_cls = self.shell.InteractiveTB.debugger_cls = cls


shell = get_ipython()
shell.magics_manager.magics.get('line', {}).pop('duck', None)
shell.register_magics(DebugMagic)

In [37]:
def roboduck(backend='openai', model=None):
    # Equivalent of native breakpoint().
    RoboDuckDB(backend=backend, model=model).set_trace(sys._getframe().f_back)

In [38]:
def foo(x):
    for i in range(x):
        roboduck()
        print(2 / (i - 3))

In [46]:
def bubble_sort(nums):
    for i in range(len(nums)):
        for j in range(len(nums)):
            if nums[j] > nums[j + 1]:
                nums[j], nums[j + 1] = nums[j + 1], nums[j]
            roboduck()
    return nums

In [40]:
# def bubble_sort(nums):
#     for i in range(len(nums)):
#         for j in range(len(nums) - 1):
#             if nums[j] > nums[j + 1]:
#                 nums[j + 1] = nums[j]
#                 nums[j] = nums[j + 1]
# #             roboduck()
#     return nums

In [41]:
nums_ = [9, 9, 9]

In [42]:
# def bubble_sort(nums):
#     for i in range(len(nums)):
#         for j in range(len(nums) - 1):
#             if nums[j] > nums[j + 1]:
#                 nums[j], nums[j + 1] = nums[j + 1], nums[j]
# #             roboduck()
#     return nums_

In [43]:
# Set some globals.
z = 100
a = ['a', 'b', 'c']
nums = DotDict({i: i*2 for i in range(100)})

In [44]:
print('This is some output.')

This is some output.


In [47]:
# Un-comment the roboduck() line.
bubble_sort([5, 2, 4, 4, 3, 1, 9, 17, 7])

[27] > [33;01m<ipython-input-46-95174b1d61b8>[00m([36;01m3[00m)bubble_sort()
-> for j in range(len(nums)):
   1 frame hidden (try 'help hidden_frames')
>>> i,j
(0, 0)
>>> n
[27] > [33;01m<ipython-input-46-95174b1d61b8>[00m([36;01m4[00m)bubble_sort()
-> if nums[j] > nums[j + 1]:
   1 frame hidden (try 'help hidden_frames')
>>> n
[27] > [33;01m<ipython-input-46-95174b1d61b8>[00m([36;01m5[00m)bubble_sort()
-> nums[j], nums[j + 1] = nums[j + 1], nums[j]
   1 frame hidden (try 'help hidden_frames')
>>> n
[27] > [33;01m<ipython-input-46-95174b1d61b8>[00m([36;01m6[00m)bubble_sort()
-> roboduck()
   1 frame hidden (try 'help hidden_frames')
>>> n
[27] > [33;01m<ipython-input-46-95174b1d61b8>[00m([36;01m3[00m)bubble_sort()
-> for j in range(len(nums)):
   1 frame hidden (try 'help hidden_frames')
>>> n
[27] > [33;01m<ipython-input-46-95174b1d61b8>[00m([36;01m4[00m)bubble_sort()
-> if nums[j] > nums[j + 1]:
   1 frame hidden (try 'help hidden_frames')
>>> n
[27] > [33;01

[0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32mi[0m[32mf[0m[32m [0m[32mn[0m[32mu[0m[32mm[0m[32ms[0m[32m[[0m[32mj[0m[32m][0m[32m [0m[32m>[0m[32m [0m[32mn[0m[32mu[0m[32mm[0m[32ms[0m[32m[[0m[32mj[0m[32m [0m[32m+[0m[32m [0m[32m1[0m[32m][0m[32m:[0m[32m
[0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32m [0m[32mn[0m[32mu[0m[32mm[0m[32ms[0m[32m[[0m[32mj[0m[32m][0m[32m,[0m[32m [0m[32mn[0m[32mu[0m[32mm[0m[32ms[0m[32m[[0m[32mj[0m[32m [0m[32m+[0m[32m [0m[32m1[0m[32m][0m[32m [0m[32m=[0m[32m [0m[32mn[0m[32mu[0m[32mm[0m[32ms[0m[32m[[0m[32mj[0m[32m [0m[32m+[0m[32m [0m[32m1[0m[32m][0m[32m,[0m[32m [0m[32mn[0m[32mu[0m[32mm[0m[32ms[0m[32m[[0m[32mj[0m[32m][0m[32m
[0m[32m [0m[32m 

BdbQuit: 

## Magic

In [18]:
# Re-comment the roboduck() line.
bubble_sort([5, 2, 4, 4, 3, 1, 9, 17, 7])

IndexError: list index out of range

In [19]:
# Note: couldn't get cell magic version working so far. Says:
# "UsageError: %%lmdb is a cell magic, but the cell body is empty. Did you
# mean the line magic %lmdb (single %)?"
# Even when I try to define the method with all the same settings as the 
# default class.
%duck -i

[1] > [33;01m<ipython-input-12-cab5a6393537>[00m([36;01m4[00m)bubble_sort()
-> if nums[j] > nums[j + 1]:
>>> Why will this throw an index error?
[31m"""ANSWER KEY

This code snippet is not working as expected. Help me debug it. First read my question, then examine the snippet of code that is causing the issue and look at the values of the local and global variables. If you see a function called roboduck, ignore it. In the section titled SOLUTION PART 1, use plain English to explain what the problem is and how to fix it. In the section titled SOLUTION PART 2, write a corrected version of the input code snippet. If you don't know what the problem is, SOLUTION PART 1 should list a few possible causes or things I could try in order to identify the issue and SOLUTION PART 2 should say N/A. Be concise and use simple language because I am a beginning programmer.

QUESTION:
Why will this throw an index error?

CURRENT CODE SNIPPET:
def bubble_sort(nums):
    for i in range(len(nums)):
   

In [None]:
def bubble_sort(nums):
    for i in range(len(nums)):
        for j in range(len(nums) - 1):
            if nums[j] > nums[j + 1]:
                nums[j], nums[j + 1] = nums[j + 1], nums[j]
#             roboduck()
    return nums

## Scratch

Mostly informal tests of functionality I developed above. Moving down because jupyter magic seems to require restarting the kernel a lot and I don't want to re-run unnecessary things (especially slow things like loading the tokenizer) or cells that need custom editing (like the one that requires you to change the cell without saving before calling load_ipynb). This means we can now restart, select this cell and click Run All Above.

In [5]:
{name: is_ipy_name(name)
 for name in ('_1', '_99', '_', '__', '_1_', '_a', '__1')}

{'_1': True,
 '_99': True,
 '_': True,
 '__': True,
 '_1_': False,
 '_a': False,
 '__1': True}

In [6]:
# Set new var on line below and do NOT save notebook.
qqq = 'xcz,vl lzvjxc'
tmp = load_ipynb(ipynbname.path())
assert qqq in tmp

<IPython.core.display.Javascript object>

In [8]:
# Set new var on line below and do NOT save. If you don't change the var, the
# test will generally fail bc a previous version of the nb will have had the
# var value.
qqq = 'a1zce1zazzzzzzzz eoiqur wqopasdfasferu'
tmp = load_ipynb(ipynbname.path(), save_if_self=False)
assert qqq not in tmp

In [9]:
for obj in (
    'string',
    ('tuple', 1),
    ['list', 2],
    {'dict': 3},
    {'set'},
    True,
    MultiLogger(None)
):
    print(type(obj), isinstance(obj, Iterable), isinstance(obj, Mapping))

<class 'str'> True False
<class 'tuple'> True False
<class 'list'> True False
<class 'dict'> True True
<class 'set'> True False
<class 'bool'> False False
<class 'htools.meta.MultiLogger'> False False


In [11]:
df_tmp = pd.DataFrame(np.arange(390).reshape(30, 13), 
                      columns=list('abcdefghijklm'))

In [12]:
print(truncated_repr(df_tmp, 79))

pd.DataFrame(columns="['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',...]")


In [13]:
print(truncated_repr(df_tmp.a, 79))

pd.Series([0, 13, 26, 39, 52, 65, 78, 91, 104, 117, 130, 143, 156, 169,...])


In [14]:
print(truncated_repr(pd.Series(np.arange(1000)), 100))

pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,...])


In [15]:
truncated_repr(df_tmp.columns.tolist() * 5, 22)

"['a', 'b', 'c', 'd',...]"

In [63]:
truncated_repr('abcdefghijklmnopqrstuvwxyz', 20)

"'abcdefghijklmno...'"

In [64]:
truncated_repr(list(range(1000)), 50)

'[0, 1, 2, 3, 4, 5, 6, 7, 8, 9,...]'

In [18]:
truncated_repr(dict(enumerate('abcdefghijklmnop')), 50)

"{0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f',...}"

In [19]:
truncated_repr(dict(enumerate('abcdefghijklmnop')), 58)

"{0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g',...}"

In [20]:
truncated_repr(True, 6)

'True'

In [21]:
truncated_repr(MultiLogger(None), 70)

'<htools.meta.MultiLogger object at 0x7fcb8442d828>'

In [22]:
truncated_repr(MultiLogger, 70)

"<class 'htools.meta.MultiLogger'>"

In [23]:
truncated_repr(MultiLogger(None), 14)

'<htools.meta.MultiLogger>'

In [24]:
truncated_repr(MultiLogger, 14)

'<class MultiLogger>'

In [25]:
truncated_repr(colored, 14)

'<function>'

In [26]:
truncated_repr(np.arange(100), 25)

'array([0, 1, 2, 3, 4,...])'

In [27]:
truncated_repr(list(range(100)), 25)

'[0, 1, 2, 3, 4, 5,...]'

In [28]:
truncated_repr(torch.arange(100), 25)

'tensor([0, 1, 2, 3, 4,...])'

In [29]:
truncated_repr(DotDict({i: i*2 for i in range(100)}), 25)

'{0: 0, 1: 2, 2: 4,...}'

In [30]:
truncated_repr(1379823479234**10, 25)

"'2.502e+121'"

In [32]:
d = {
    'a': [1, 2, 3],
    True: 223_456,
    (3, 4): {'a', 'z', 'a'},
    0: {1: 3, 5: 'x'},
    'dd': DotDict({3: '4', 'a': True}),
    'ddcls': DotDict,
}
print(type_annotated_dict_str(d))

{
    'a': [1, 2, 3],   # type: list
    True: 223456,   # type: int
    (3, 4): {'a', 'z'},   # type: set
    0: {1: 3, 5: 'x'},   # type: dict
    'dd': {3: '4', 'a': True},   # type: DotDict
    'ddcls': <class 'htools.structures.DotDict'>,   # type: type
}


In [33]:
d = {
    'a': list(range(100)),
    True: 223_456**40,
    (3, 4): {'a'*i for i in range(100)},
    0: {1: 3, 5: 'x'},
    'dd': DotDict({i: 1/(i+1) for i in range(200)}),
    'ddcls': DotDict,
}
print(type_annotated_dict_str(d, partial(truncated_repr, max_len=50)))

{
    'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,...],   # type: list
    True: '9.283e+213',   # type: int
    (3, 4): set(...),   # type: set
    0: {1: 3, 5: 'x'},   # type: dict
    'dd': {0: 1.0, 1: 0.5, 2: 0.3333333333333333, 3: 0.25,...},   # type: DotDict
    'ddcls': <class 'htools.structures.DotDict'>,   # type: type
}


In [34]:
d = {
    'a': list(range(100)),
    True: 223_456**40,
    (3, 4): {'a'*i for i in range(100)},
    0: {1: 3, 5: 'x'},
    'dd': DotDict({i: 1/(i+1) for i in range(200)}),
    'ddcls': DotDict,
}
print(type_annotated_dict_str(d, partial(truncated_repr, max_len=79)))

{
    'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,...],   # type: list
    True: '9.283e+213',   # type: int
    (3, 4): {'',...},   # type: set
    0: {1: 3, 5: 'x'},   # type: dict
    'dd': {0: 1.0, 1: 0.5, 2: 0.3333333333333333, 3: 0.25, 4: 0.2, 5: 0.16666666666666666,...},   # type: DotDict
    'ddcls': <class 'htools.structures.DotDict'>,   # type: type
}


In [35]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')

In [36]:
len(tokenizer.tokenize(
    type_annotated_dict_str(d, partial(truncated_repr, max_len=79))
))

210

In [37]:
truncated_repr(
    {'a'*i for i in range(100)},
    53
)

'set(...)'

In [38]:
truncated_repr(
    {'a'*i for i in range(100)},
    54
)

"{'',...}"

In [39]:
truncated_repr(
    ['a'*i for i in range(100)],
    53
)

'[...]'

In [40]:
truncated_repr(
    ['a'*i for i in range(100)],
    54
)

"['',...]"