In [1]:
#default_exp mdx

# Preprocessors For MDX

> Custom preprocessors that help convert notebook content into MDX

This module defines [nbconvert.Custom Preprocessors](https://nbconvert.readthedocs.io/en/latest/nbconvert_library.html#Custom-Preprocessors) that facilitate transforming notebook content into MDX, which is a variation of markdown.

## Cell Tag Cheatsheet

These preprocessors allow you to make special comments to enable/disable them.  Here is a list of all special comments:

All comments start with `#meta` or `#cell_meta`, which are both aliases for the same thing.  For brevity, we will use `#meta` in this cheatsheet.

### Black code formatting

`#meta:tag=black` will apply black code formatting.

### Show/Hide Cells

1. Remvoe entire cells:  `#meta:tag=remove_cell` or `#meta:tag=hide`
2. Remove output: `#meta:tag=remove_output` or `#meta:tag=remove_output` or `#meta:tag=hide_outputs` or `#meta:tag=hide_output`
3. Remove input: same as above, except `input` instead of `output`.

### Hiding specific lines of outptut

1. Remove lines of output containing keywords: `#meta:filter_words=FutureWarning,MultiIndex`
2. Show maximum number of lines of output: `#meta:limit=6`, will show only the first 6 lines


### Hiding specific lines of input (code):

Use the comment `#meta_hide_line` to hide a specific line of code:

```python
def show():
    a = 2
    b = 3 #meta_hide_line
```

### Selecting Metaflow Steps

You can selectively show meataflow steps in the output logs:

1. Show one step: `#meta:show_steps=<step_name>`
2. Show multiple steps: `#meta:show_steps=<step1_name>,<step2_name>`

In [2]:
# export
from nbconvert.preprocessors import Preprocessor
from nbconvert import MarkdownExporter
from nbconvert.preprocessors import TagRemovePreprocessor
from nbdev.imports import get_config
from traitlets.config import Config
from pathlib import Path
import re, uuid
from fastcore.basics import AttrDict
from nbdoc.media import ImagePath, ImageSave, HTMLEscape
from black import format_str, Mode

In [3]:
#hide
from nbdev.export import read_nb
from nbconvert import NotebookExporter
from nbdoc.test_utils import run_preprocessor, show_plain_md
from nbdoc.run import _gen_nb
import json

__file__ = str(get_config().path("lib_path")/'preproc.py')

In [4]:
#export
_re_meta= r'^\s*#(?:cell_meta|meta):\S+\s*[\n\r]'

## Injecting Metadata Into Cells -

In [5]:
#export
class InjectMeta(Preprocessor):
    """
    Allows you to inject metadata into a cell for further preprocessing with a comment.
    """
    pattern = r'(^\s*#(?:cell_meta|meta):)(\S+)(\s*[\n\r])'
    
    def preprocess_cell(self, cell, resources, index):
        if cell.cell_type == 'code' and re.search(_re_meta, cell.source, flags=re.MULTILINE):
            cell_meta = re.findall(self.pattern, cell.source, re.MULTILINE)
            d = cell.metadata.get('nbdoc', {})
            for _, m, _ in cell_meta:
                if '=' in m:
                    k,v = m.split('=')
                    d[k] = v
                else: print(f"Warning cell_meta:{m} does not have '=' will be ignored.")
            cell.metadata['nbdoc'] = d
        return cell, resources

To inject metadata make a comment in a cell with the following pattern: `#cell_meta:{key=value}`. Note that `#meta` is an alias for `#cell_meta`

For example, consider the following code:

In [6]:

_test_file = 'test_files/hello_world.ipynb'
first_cell = read_nb(_test_file)['cells'][0]
print(first_cell['source'])

#meta:show_steps=start,train
print('hello world')


At the moment, this cell has no metadata:

In [7]:
print(first_cell['metadata'])

{}


However, after we process this notebook with `InjectMeta`, the appropriate metadata will be injected:

In [8]:
c = Config()
c.NotebookExporter.preprocessors = [InjectMeta]
exp = NotebookExporter(config=c)
cells, _ = exp.from_filename(_test_file)
first_cell = json.loads(cells)['cells'][0]

assert first_cell['metadata'] == {'nbdoc': {'show_steps': 'start,train'}}
first_cell['metadata']

{'nbdoc': {'show_steps': 'start,train'}}

## Strip Ansi Characters From Output -

In [9]:
#export
_re_ansi_escape = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])')

class StripAnsi(Preprocessor):
    """Strip Ansi Characters."""
    
    def preprocess_cell(self, cell, resources, index):
        for o in cell.get('outputs', []):
            if o.get('name') and o.name == 'stdout': 
                o['text'] = _re_ansi_escape.sub('', o.text)
        return cell, resources

Gets rid of colors that are streamed from standard out, which can interfere with static site generators:

In [10]:
c, _ = run_preprocessor([StripAnsi], 'test_files/run_flow.ipynb')
assert not _re_ansi_escape.findall(c)

In [11]:
# export
def _get_cell_id(id_length=36):
    "generate random id for artifical notebook cell"
    return uuid.uuid4().hex[:id_length]

def _get_md_cell(content="<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! Instead, edit the notebook w/the location & name as this file. -->"):
    "generate markdown cell with content"
    cell = AttrDict({'cell_type': 'markdown',
                     'id': f'{_get_cell_id()}',
                     'metadata': {},
                     'source': f'{content}'})
    return cell

## Insert Warning Into Markdown -

In [12]:
# export
class InsertWarning(Preprocessor):
    """Insert Autogenerated Warning Into Notebook after the first cell."""
    def preprocess(self, nb, resources):
        nb.cells = nb.cells[:1] + [_get_md_cell()] + nb.cells[1:]
        return nb, resources

This preprocessor inserts a warning in the markdown destination that the file is autogenerated.  This warning is inserted in the second cell so we do not interfere with front matter.

In [13]:
c, _ = run_preprocessor([InsertWarning], 'test_files/hello_world.ipynb', display_results=True)
assert "<!-- WARNING: THIS FILE WAS AUTOGENERATED!" in c

```python
#meta:show_steps=start,train
print('hello world')
```

<CodeOutputBlock lang="python">

```
    hello world
```

</CodeOutputBlock>



```python

```



## Remove Empty Code Cells -

In [14]:
# export
def _emptyCodeCell(cell):
    "Return True if cell is an empty Code Cell."
    if cell['cell_type'] == 'code':
        if not cell.source or not cell.source.strip(): return True
    else: return False


class RmEmptyCode(Preprocessor):
    """Remove empty code cells."""
    def preprocess(self, nb, resources):
        new_cells = [c for c in nb.cells if not _emptyCodeCell(c)]
        nb.cells = new_cells
        return nb, resources

Notice how this notebook has an empty code cell at the end:

In [15]:
show_plain_md('test_files/hello_world.ipynb')

```python
#meta:show_steps=start,train
print('hello world')
```

    hello world



```python

```



With `RmEmptyCode` these empty code cells are stripped from the markdown:

In [16]:
c, _ = run_preprocessor([RmEmptyCode], 'test_files/hello_world.ipynb', display_results=True)
assert len(re.findall('```python',c)) == 1

```python
#meta:show_steps=start,train
print('hello world')
```

<CodeOutputBlock lang="python">

```
    hello world
```

</CodeOutputBlock>



## Truncate Metaflow Output -

In [17]:
#export
class MetaflowTruncate(Preprocessor):
    """Remove the preamble and timestamp from Metaflow output."""
    _re_pre = re.compile(r'([\s\S]*Metaflow[\s\S]*Validating[\s\S]+The graph[\s\S]+)(\n[\s\S]+Workflow starting[\s\S]+)')
    _re_time = re.compile('\d{4}-\d{2}-\d{2}\s\d{2}\:\d{2}\:\d{2}.\d{3}')
    
    def preprocess_cell(self, cell, resources, index):
        if re.search('\s*python.+run.*', cell.source) and 'outputs' in cell:
            for o in cell.outputs:
                if o.name == 'stdout':
                    o['text'] = self._re_time.sub('', self._re_pre.sub(r'\2', o.text)).strip()
        return cell, resources

When you run a metaflow Flow, you are presented with a fair amount of boilerpalte before the job starts running that is not necesary to show in the documentation:

In [18]:
show_plain_md('test_files/run_flow.ipynb')

```python
#meta:show_steps=start
!python myflow.py run
```

    [35m[1mMetaflow 2.5.3[0m[35m[22m executing [0m[31m[1mMyFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hamel[0m[35m[22m[K[0m[35m[22m[0m
    [35m[22mValidating your flow...[K[0m[35m[22m[0m
    [32m[1m    The graph looks good![K[0m[32m[1m[0m
    [35m[22mRunning pylint...[K[0m[35m[22m[0m
    [32m[1m    Pylint is happy![K[0m[32m[1m[0m
    [35m2022-03-14 17:28:44.983 [0m[1mWorkflow starting (run-id 1647304124981100):[0m
    [35m2022-03-14 17:28:44.990 [0m[32m[1647304124981100/start/1 (pid 41951)] [0m[1mTask is starting.[0m
    [35m2022-03-14 17:28:45.630 [0m[32m[1647304124981100/start/1 (pid 41951)] [0m[22mthis is the start[0m
    [35m2022-03-14 17:28:45.704 [0m[32m[1647304124981100/start/1 (pid 41951)] [0m[1mTask finished successfully.[0m
    [35m2022-03-14 17:28:45.710 [0m[32m[1647304124981100/end/2 (pid 41954)] [0m[1mTask is starting.[0m
    [35m

We don't need to see the beginning part that validates the graph, and we don't need the time-stamps either.  We can remove these with the `MetaflowTruncate` preprocessor:

In [19]:
c, _ = run_preprocessor([MetaflowTruncate], 'test_files/run_flow.ipynb', display_results=True)
assert 'Validating your flow...' not in c

```python
#meta:show_steps=start
!python myflow.py run
```

<CodeOutputBlock lang="python">

```
    [35m [0m[1mWorkflow starting (run-id 1647304124981100):[0m
    [35m [0m[32m[1647304124981100/start/1 (pid 41951)] [0m[1mTask is starting.[0m
    [35m [0m[32m[1647304124981100/start/1 (pid 41951)] [0m[22mthis is the start[0m
    [35m [0m[32m[1647304124981100/start/1 (pid 41951)] [0m[1mTask finished successfully.[0m
    [35m [0m[32m[1647304124981100/end/2 (pid 41954)] [0m[1mTask is starting.[0m
    [35m [0m[32m[1647304124981100/end/2 (pid 41954)] [0m[22mthis is the end[0m
    [35m [0m[32m[1647304124981100/end/2 (pid 41954)] [0m[1mTask finished successfully.[0m
    [35m [0m[1mDone![0m
    [0m
```

</CodeOutputBlock>



## Turn Metadata into Cell Tags -

In [20]:
#export
class UpdateTags(Preprocessor):
    """
    Create cell tags based upon comment `#cell_meta:tags=<tag>`
    """
    
    def preprocess_cell(self, cell, resources, index):
        root = cell.metadata.get('nbdoc', {})
        tags = root.get('tags', root.get('tag')) # allow the singular also
        if tags: cell.metadata['tags'] = cell.metadata.get('tags', []) + tags.split(',')
        return cell, resources

Consider this python notebook prior to processing.  The comments can be used configure the visibility of cells. 

- `#cell_meta:tags=remove_output` will just remove the output
- `#cell_meta:tags=remove_input` will just remove the input
- `#cell_meta:tags=remove_cell` will remove both the input and output

Note that you can use `#cell_meta:tag` or `#cell_meta:tags` as they are both aliases for the same thing.  Here is a notebook before preprocessing:

In [21]:
show_plain_md('test_files/visibility.ipynb')

# Configuring Cell Visibility

#### Cell with the comment `#cell_meta:tag=remove_output`


```
#cell_meta:tag=remove_output
print('the output is removed, so you can only see the print statement.')
```

    the output is removed, so you can only see the print statement.


#### Cell with the comment `#cell_meta:tag=remove_input`


```
#cell_meta:tag=remove_input
print('hello, you cannot see the code that created me.')
```

    hello, you cannot see the code that created me.


#### Cell with the comment `#cell_meta:tag=remove_cell`


```
#cell_meta:tag=remove_cell
print('you will not be able to see this cell at all')
```

    you will not be able to see this cell at all



```
#cell_meta:tags=remove_input,remove_output
print('you will not be able to see this cell at all either')
```

    you will not be able to see this cell at all either




`UpdateTags` is meant to be used with `InjectMeta` and `TagRemovePreprocessor` to configure the visibility of cells in rendered docs.  Here you can see what the notebook looks like after pre-processing:

In [22]:
# Configure an exporter from scratch
_test_file = 'test_files/visibility.ipynb'
c = Config()
c.TagRemovePreprocessor.remove_cell_tags = ("remove_cell",)
c.TagRemovePreprocessor.remove_all_outputs_tags = ('remove_output',)
c.TagRemovePreprocessor.remove_input_tags = ('remove_input',)
c.MarkdownExporter.preprocessors = [InjectMeta, UpdateTags, TagRemovePreprocessor]
exp = MarkdownExporter(config=c)
result = exp.from_filename(_test_file)[0]

# show the results
assert 'you will not be able to see this cell at all either' not in result
print(result)

# Configuring Cell Visibility

#### Cell with the comment `#cell_meta:tag=remove_output`


```
#cell_meta:tag=remove_output
print('the output is removed, so you can only see the print statement.')
```

#### Cell with the comment `#cell_meta:tag=remove_input`

    hello, you cannot see the code that created me.


#### Cell with the comment `#cell_meta:tag=remove_cell`



## Selecting Metaflow Steps In Output -

In [23]:
#export
class MetaflowSelectSteps(Preprocessor):
    """
    Hide Metaflow steps in output based on cell metadata.
    """
    re_step = r'.*\d+/{0}/\d+\s\(pid\s\d+\).*'
    
    def preprocess_cell(self, cell, resources, index):
        root = cell.metadata.get('nbdoc', {})
        steps = root.get('show_steps', root.get('show_step'))
        if re.search('\s*python.+run.*', cell.source) and 'outputs' in cell and steps:
            for o in cell.outputs:
                if o.name == 'stdout':
                    final_steps = []
                    for s in steps.split(','):
                        found_steps = re.compile(self.re_step.format(s)).findall(o['text'])
                        if found_steps: 
                            final_steps += found_steps + ['...']
                    o['text'] = '\n'.join(['...'] + final_steps)
        return cell, resources

`MetaflowSelectSteps` is meant to be used with `InjectMeta` to only show specific steps in the output logs from Metaflow.  

For example, if you want to only show the `start` and `train` steps in your flow, you would annotate your cell with the following pattern: `#cell_meta:show_steps=<step_name>`

Note that `show_step` and `show_steps` are aliases for convenience, so you don't need to worry about the `s` at the end.

In the below example, `#cell_meta:show_steps=start,train` shows the `start` and `train` steps, whereas `#cell_meta:show_steps=train` only shows the `train` step:

In [24]:
c, _ = run_preprocessor([InjectMeta, MetaflowSelectSteps], 
                        'test_files/run_flow_showstep.ipynb', 
                        display_results=True)
assert 'end' not in c

```
#cell_meta:show_steps=start,train
!python myflow.py run
```

<CodeOutputBlock lang="">

```
    ...
    [35m2022-02-15 14:01:14.810 [0m[32m[1644962474801237/start/1 (pid 46758)] [0m[1mTask is starting.[0m
    [35m2022-02-15 14:01:15.433 [0m[32m[1644962474801237/start/1 (pid 46758)] [0m[22mthis is the start[0m
    [35m2022-02-15 14:01:15.500 [0m[32m[1644962474801237/start/1 (pid 46758)] [0m[1mTask finished successfully.[0m
    ...
    [35m2022-02-15 14:01:15.507 [0m[32m[1644962474801237/train/2 (pid 46763)] [0m[1mTask is starting.[0m
    [35m2022-02-15 14:01:16.123 [0m[32m[1644962474801237/train/2 (pid 46763)] [0m[22mthe train step[0m
    [35m2022-02-15 14:01:16.188 [0m[32m[1644962474801237/train/2 (pid 46763)] [0m[1mTask finished successfully.[0m
    ...
```

</CodeOutputBlock>


```
#cell_meta:show_steps=train
!python myflow.py run
```

<CodeOutputBlock lang="">

```
    ...
    [35m2022-02-15 14:01:18.924 [0m[32m[1644962478210532/train/2 (pi

## Hide Specific Lines of Output With Keywords -

In [25]:
#export
class FilterOutput(Preprocessor):
    """
    Hide Output Based on Keywords.
    """
    def preprocess_cell(self, cell, resources, index):
        root = cell.metadata.get('nbdoc', {})
        words = root.get('filter_words', root.get('filter_word'))
        if 'outputs' in cell and words:
            _re = f"^(?!.*({'|'.join(words.split(','))}))"
            for o in cell.outputs:
                if o.name == 'stdout':
                    filtered_lines = [l for l in o['text'].splitlines() if re.findall(_re, l)]
                    o['text'] = '\n'.join(filtered_lines)
        return cell, resources

If we want to exclude output with certain keywords, we can use the `#meta:filter_words` comment.  For example, if we wanted to ignore all output that contains the text `FutureWarning` or `MultiIndex` we can use the comment:

`#meta:filter_words=FutureWarning,MultiIndex`

Consider this output below:

In [26]:
show_plain_md('test_files/strip_out.ipynb')

```python
#meta:show_steps=end
!python serialize_xgb_dmatrix.py run
```

      from pandas import MultiIndex, Int64Index
    [35m[1mMetaflow 2.5.3[0m[35m[22m executing [0m[31m[1mSerializeXGBDataFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hamel[0m[35m[22m[K[0m[35m[22m[0m
    [35m[22mValidating your flow...[K[0m[35m[22m[0m
    [32m[1m    The graph looks good![K[0m[32m[1m[0m
    [35m[22mRunning pylint...[K[0m[35m[22m[0m
    [32m[1m    Pylint is happy![K[0m[32m[1m[0m
    [35m2022-03-30 07:04:02.315 [0m[1mWorkflow starting (run-id 1648649042312116):[0m
    [35m2022-03-30 07:04:02.322 [0m[32m[1648649042312116/start/1 (pid 2459)] [0m[1mTask is starting.[0m
    [35m2022-03-30 07:04:03.508 [0m[32m[1648649042312116/start/1 (pid 2459)] [0m[22mfrom pandas import MultiIndex, Int64Index[0m
    [35m2022-03-30 07:04:03.510 [0m[32m[1648649042312116/start/1 (pid 2459)] [0m[1mTask finished successfully.[0m
    [35m2022-03-30 07

Notice how the lines containing the terms `FutureWarning` or `MultiIndex` are stripped out:

In [27]:
c, _ = run_preprocessor([InjectMeta, FilterOutput], 
                        'test_files/strip_out.ipynb', 
                        display_results=True)
assert 'FutureWarning:' not in c and 'from pandas import MultiIndex, Int64Index' not in c

```python
#meta:show_steps=end
!python serialize_xgb_dmatrix.py run
```

<CodeOutputBlock lang="python">

```
    [35m[1mMetaflow 2.5.3[0m[35m[22m executing [0m[31m[1mSerializeXGBDataFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hamel[0m[35m[22m[K[0m[35m[22m[0m
    [35m[22mValidating your flow...[K[0m[35m[22m[0m
    [32m[1m    The graph looks good![K[0m[32m[1m[0m
    [35m[22mRunning pylint...[K[0m[35m[22m[0m
    [32m[1m    Pylint is happy![K[0m[32m[1m[0m
    [35m2022-03-30 07:04:02.315 [0m[1mWorkflow starting (run-id 1648649042312116):[0m
    [35m2022-03-30 07:04:02.322 [0m[32m[1648649042312116/start/1 (pid 2459)] [0m[1mTask is starting.[0m
    [35m2022-03-30 07:04:03.510 [0m[32m[1648649042312116/start/1 (pid 2459)] [0m[1mTask finished successfully.[0m
    [35m2022-03-30 07:04:03.517 [0m[32m[1648649042312116/end/2 (pid 2462)] [0m[1mTask is starting.[0m
    [35m2022-03-30 07:04:04.563 [0m[32m[1648649042312116/

## Limit The Number Of Lines Of Output -

In [28]:
#export
class Limit(Preprocessor):
    """
    Limit The Number of Lines Of Output Based on Keywords.
    """
    def preprocess_cell(self, cell, resources, index):
        root = cell.metadata.get('nbdoc', {})
        n = root.get('limit')
        if 'outputs' in cell and n:
            for o in cell.outputs:
                if o.name == 'stdout':
                    o['text'] = '\n'.join(o['text'].splitlines()[:int(n)] + ['...'])
        return cell, resources

In [29]:
c, _ = run_preprocessor([InjectMeta, Limit], 
                        'test_files/limit.ipynb', 
                        display_results=True)

_res = """```
    hello
    hello
    hello
    hello
    hello
    ...
```"""
assert _res in c

```python
#meta:limit=6
!python serialize_xgb_dmatrix.py run
```

<CodeOutputBlock lang="python">

```
      from pandas import MultiIndex, Int64Index
    [35m[1mMetaflow 2.5.3[0m[35m[22m executing [0m[31m[1mSerializeXGBDataFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hamel[0m[35m[22m[K[0m[35m[22m[0m
    [35m[22mValidating your flow...[K[0m[35m[22m[0m
    [32m[1m    The graph looks good![K[0m[32m[1m[0m
    [35m[22mRunning pylint...[K[0m[35m[22m[0m
    ...
```

</CodeOutputBlock>


```python
#meta:limit=5
print('\n'.join(['hello']*10))
```

<CodeOutputBlock lang="python">

```
    hello
    hello
    hello
    hello
    hello
    ...
```

</CodeOutputBlock>



## Hide Specific Lines of Code -

In [30]:
#export
class HideInputLines(Preprocessor):
    """
    Hide lines of code in code cells with the comment `#meta_hide_line` at the end of a line of code.
    """
    tok = '#meta_hide_line'
    
    def preprocess_cell(self, cell, resources, index):
        if cell.cell_type == 'code':
            if self.tok in cell.source:
                cell.source = '\n'.join([c for c in cell.source.split('\n') if not c.strip().endswith(self.tok)])
        return cell, resources

You can use the special comment `#meta_hide_line` to hide a specific line of code in a code cell.  This is what the code looks like before:

In [31]:
show_plain_md('test_files/hide_lines.ipynb')

```python
def show():
    a = 2
    b = 3 #meta_hide_line
```



and after:

In [32]:
c, _ = run_preprocessor([InjectMeta, HideInputLines], 
                        'test_files/hide_lines.ipynb', 
                        display_results=True)

```python
def show():
    a = 2
```



In [33]:
#hide 
_res = """```python
def show():
    a = 2
```"""
assert _res in c

## Handle Scripts With `%%writefile` -

In [34]:
#export
class WriteTitle(Preprocessor):
    """Modify the code-fence with the filename upon %%writefile cell magic."""
    pattern = r'(^[\S\s]*%%writefile\s)(\S+)\n'
    
    def preprocess_cell(self, cell, resources, index):
        m = re.match(self.pattern, cell.source)
        if m: 
            filename = m.group(2)
            ext = filename.split('.')[-1]
            cell.metadata.magics_language = f'{ext} title="{filename}"'
            cell.metadata.script = True
            cell.metadata.file_ext = ext
            cell.metadata.filename = filename
            cell.outputs = []
        return cell, resources

`WriteTitle` creates the proper code-fence with a title in the situation where the `%%writefile` magic is used.

For example, here are contents before pre-processing:

In [35]:
show_plain_md('test_files/writefile.ipynb')

A test notebook


```python
%%writefile myflow.py
from metaflow import FlowSpec, step

class MyFlow(FlowSpec):
    
    @step
    def start(self):
        print('this is the start')
        self.next(self.train)
    
    @step
    def train(self):
        print('the train step')
        self.next(self.end)
    
    @step
    def end(self):
        print('this is the end')

if __name__ == '__main__':
    MyFlow()
```

    Overwriting myflow.py



```python
%%writefile hello.txt

Hello World
```

    Overwriting hello.txt




When we use `WriteTitle`, you will see the code-fence will change appropriately:

In [36]:
c, _ = run_preprocessor([WriteTitle], 'test_files/writefile.ipynb', display_results=True)
assert '```py title="myflow.py"' in c and '```txt title="hello.txt"' in c

A test notebook


```py title="myflow.py"
%%writefile myflow.py
from metaflow import FlowSpec, step

class MyFlow(FlowSpec):
    
    @step
    def start(self):
        print('this is the start')
        self.next(self.train)
    
    @step
    def train(self):
        print('the train step')
        self.next(self.end)
    
    @step
    def end(self):
        print('this is the end')

if __name__ == '__main__':
    MyFlow()
```


```txt title="hello.txt"
%%writefile hello.txt

Hello World
```



## Clean Flags and Magics -

In [37]:
#export
_tst_flags = get_config()['tst_flags'].split('|')

class CleanFlags(Preprocessor):
    """A preprocessor to remove Flags"""
    patterns = [re.compile(r'^#\s*{0}\s*'.format(f), re.MULTILINE) for f in _tst_flags]
    
    def preprocess_cell(self, cell, resources, index):
        if cell.cell_type == 'code':
            for p in self.patterns:
                cell.source = p.sub('', cell.source).strip()
        return cell, resources

In [38]:
c, _ = run_preprocessor([CleanFlags], _gen_nb())
assert '#notest' not in c

In [39]:
#export
class CleanMagics(Preprocessor):
    """A preprocessor to remove cell magic commands and #cell_meta: comments"""
    pattern = re.compile(r'(^\s*(%%|%).+?[\n\r])|({0})'.format(_re_meta), re.MULTILINE)
    
    def preprocess_cell(self, cell, resources, index):
        if cell.cell_type == 'code': 
            cell.source = self.pattern.sub('', cell.source).strip()
        return cell, resources

`CleanMagics` strips magic cell commands `%%` so they do not appear in rendered markdown files:

In [40]:
c, _ = run_preprocessor([WriteTitle, CleanMagics], 'test_files/writefile.ipynb', display_results=True)
assert '%%' not in c

A test notebook


```py title="myflow.py"
from metaflow import FlowSpec, step

class MyFlow(FlowSpec):
    
    @step
    def start(self):
        print('this is the start')
        self.next(self.train)
    
    @step
    def train(self):
        print('the train step')
        self.next(self.end)
    
    @step
    def end(self):
        print('this is the end')

if __name__ == '__main__':
    MyFlow()
```


```txt title="hello.txt"
Hello World
```



Here is how `CleanMagics` Works on the file with the Metaflow log outputs from earlier, we can see that the `#cell_meta` comments are gone:

In [41]:
c, _ = run_preprocessor([InjectMeta, MetaflowSelectSteps, CleanMagics], 
                        'test_files/run_flow_showstep.ipynb', display_results=True)

```
!python myflow.py run
```

<CodeOutputBlock lang="">

```
    ...
    [35m2022-02-15 14:01:14.810 [0m[32m[1644962474801237/start/1 (pid 46758)] [0m[1mTask is starting.[0m
    [35m2022-02-15 14:01:15.433 [0m[32m[1644962474801237/start/1 (pid 46758)] [0m[22mthis is the start[0m
    [35m2022-02-15 14:01:15.500 [0m[32m[1644962474801237/start/1 (pid 46758)] [0m[1mTask finished successfully.[0m
    ...
    [35m2022-02-15 14:01:15.507 [0m[32m[1644962474801237/train/2 (pid 46763)] [0m[1mTask is starting.[0m
    [35m2022-02-15 14:01:16.123 [0m[32m[1644962474801237/train/2 (pid 46763)] [0m[22mthe train step[0m
    [35m2022-02-15 14:01:16.188 [0m[32m[1644962474801237/train/2 (pid 46763)] [0m[1mTask finished successfully.[0m
    ...
```

</CodeOutputBlock>


```
!python myflow.py run
```

<CodeOutputBlock lang="">

```
    ...
    [35m2022-02-15 14:01:18.924 [0m[32m[1644962478210532/train/2 (pid 46783)] [0m[1mTask is starting.[0m
    [35m2022-02-15 14

In [42]:
#hide
c, _ = run_preprocessor([WriteTitle, CleanMagics], 'test_files/hello_world.ipynb')
assert '#cell_meta' not in c

## Formatting Code With Black -

In [43]:
#export
black_mode = Mode()

class Black(Preprocessor):
    """Format code that has a cell tag `black`"""
    def preprocess_cell(self, cell, resources, index):
        tags = cell.metadata.get('tags', [])
        if cell.cell_type == 'code' and 'black' in tags:
            cell.source = format_str(src_contents=cell.source, mode=black_mode).strip()
        return cell, resources

`Black` is a preprocessor that will format cells that have the cell tag `black` with [Python black](https://github.com/psf/black) code formatting.  You can apply tags via the notebook interface or with a comment `meta:tag=black`.

This is how cell formatting looks before [black](https://github.com/psf/black) formatting:

In [44]:
show_plain_md('test_files/black.ipynb')

Format with black


```python
#meta:tag=black
j = [1,
     2,
     3
]
```


```python
%%writefile black_test.py
#meta:tag=black


def very_important_function(template: str, *variables, file: os.PathLike, engine: str, header: bool = True, debug: bool = False):
    """Applies `variables` to the `template` and writes to `file`."""
    with open(file, 'w') as f:
        pass
```



After black is applied, the code looks like this:

In [45]:
c, _ = run_preprocessor([InjectMeta, UpdateTags, CleanMagics, Black], 'test_files/black.ipynb', display_results=True)
assert '[1, 2, 3]' in c
assert 'very_important_function(\n    template: str,' in c

Format with black


```python
j = [1, 2, 3]
```


```python
def very_important_function(
    template: str,
    *variables,
    file: os.PathLike,
    engine: str,
    header: bool = True,
    debug: bool = False
):
    """Applies `variables` to the `template` and writes to `file`."""
    with open(file, "w") as f:
        pass
```



## Show File Contents -

In [46]:
#export
class CatFiles(Preprocessor):
    """Cat arbitrary files with %cat"""
    pattern = '^\s*!'
    
    def preprocess_cell(self, cell, resources, index):
        if cell.cell_type == 'code' and re.search(self.pattern, cell.source):
            cell.metadata.magics_language = 'bash'
            cell.source = re.sub(self.pattern, '', cell.source).strip()
        return cell, resources

## Format Shell Commands -

In [47]:
#export
class BashIdentify(Preprocessor):
    """A preprocessor to identify bash commands and mark them appropriately"""
    pattern = re.compile('^\s*!', flags=re.MULTILINE)
    
    def preprocess_cell(self, cell, resources, index):
        if cell.cell_type == 'code' and self.pattern.search(cell.source):
            cell.metadata.magics_language = 'bash'
            cell.source = self.pattern.sub('', cell.source).strip()
        return cell, resources

When we issue a shell command in a notebook with `!`, we need to change the code-fence from `python` to `bash` and remove the `!`:

In [48]:
c, _ = run_preprocessor([MetaflowTruncate, CleanMagics, BashIdentify], 'test_files/run_flow.ipynb', display_results=True)
assert "```bash" in c and '!python' not in c

```bash
python myflow.py run
```

<CodeOutputBlock lang="bash">

```
    [35m [0m[1mWorkflow starting (run-id 1647304124981100):[0m
    [35m [0m[32m[1647304124981100/start/1 (pid 41951)] [0m[1mTask is starting.[0m
    [35m [0m[32m[1647304124981100/start/1 (pid 41951)] [0m[22mthis is the start[0m
    [35m [0m[32m[1647304124981100/start/1 (pid 41951)] [0m[1mTask finished successfully.[0m
    [35m [0m[32m[1647304124981100/end/2 (pid 41954)] [0m[1mTask is starting.[0m
    [35m [0m[32m[1647304124981100/end/2 (pid 41954)] [0m[22mthis is the end[0m
    [35m [0m[32m[1647304124981100/end/2 (pid 41954)] [0m[1mTask finished successfully.[0m
    [35m [0m[1mDone![0m
    [0m
```

</CodeOutputBlock>



## Remove `ShowDoc` Input Cells -

In [49]:
#export
_re_showdoc = re.compile(r'^ShowDoc', re.MULTILINE)


def _isShowDoc(cell):
    "Return True if cell contains ShowDoc."
    if cell['cell_type'] == 'code':
        if _re_showdoc.search(cell.source): return True
    else: return False


class CleanShowDoc(Preprocessor):
    """Ensure that ShowDoc output gets cleaned in the associated notebook."""
    _re_html = re.compile(r'<HTMLRemove>.*</HTMLRemove>', re.DOTALL)
    
    def preprocess_cell(self, cell, resources, index):
        "Convert cell to a raw cell with just the stripped portion of the output."
        if _isShowDoc(cell):
            all_outs = [o['data'] for o in cell.outputs if 'data' in o]
            html_outs = [o['text/html'] for o in all_outs if 'text/html' in o]
            if len(html_outs) != 1:
                return cell, resources
            cleaned_html = self._re_html.sub('', html_outs[0])
            cell = AttrDict({'cell_type':'raw', 'id':cell.id, 'metadata':cell.metadata, 'source':cleaned_html})
                    
        return cell, resources

In [50]:
_result, _ = run_preprocessor([CleanShowDoc], 'test_files/doc.ipynb')
assert '<HTMLRemove>' not in _result
print(_result)

```python
from fastcore.all import test_eq
from nbdoc.showdoc import ShowDoc
```


<DocSection type="function" name="test_eq" module="fastcore.test" link="https://github.com/fastcore/tree/masterhttps://github.com/fastai/fastcore/tree/master/fastcore/test.py#L34">
<SigArgSection>
<SigArg name="a" /><SigArg name="b" />
</SigArgSection>
<Description summary="`test` that `a==b`" />

</DocSection>




## Composing Preprocessors Into A Pipeline

Lets see how you can compose all of these preprocessors together to process notebooks appropriately:

In [51]:
#export
def get_mdx_exporter(template_file='ob.tpl'):
    """A mdx notebook exporter which composes many pre-processors together."""
    c = Config()
    c.TagRemovePreprocessor.remove_cell_tags = ("remove_cell", "hide")
    c.TagRemovePreprocessor.remove_all_outputs_tags = ("remove_output", "remove_outputs", "hide_output", "hide_outputs")
    c.TagRemovePreprocessor.remove_input_tags = ('remove_input', 'remove_inputs', "hide_input", "hide_inputs")
    pp = [InjectMeta, WriteTitle, CleanMagics, BashIdentify, MetaflowTruncate,
          MetaflowSelectSteps, UpdateTags, InsertWarning, TagRemovePreprocessor, CleanFlags, CleanShowDoc, RmEmptyCode, 
          StripAnsi, Limit, HideInputLines, FilterOutput, Black, ImageSave, ImagePath, HTMLEscape]
    c.MarkdownExporter.preprocessors = pp
    tmp_dir = Path(__file__).parent/'templates/'
    tmp_file = tmp_dir/f"{template_file}"
    if not tmp_file.exists(): raise ValueError(f"{tmp_file} does not exist in {tmp_dir}")
    c.MarkdownExporter.template_file = str(tmp_file)
    return MarkdownExporter(config=c)

`get_mdx_exporter` combines all of the previous preprocessors, along with the built in `TagRemovePreprocessor` to allow for hiding cell inputs/outputs based on cell tags.  Here is an example of markdown generated from a notebook with the default preprocessing:

In [52]:
show_plain_md('test_files/example_input.ipynb')

---
title: my hello page title
description: my hello page description
hide_table_of_contents: true
---
## This is a test notebook

This is a shell command:


```python
! echo hello
```

    hello


We are writing a python script to disk:


```python
%%writefile myflow.py

from metaflow import FlowSpec, step

class MyFlow(FlowSpec):
    
    @step
    def start(self):
        print('this is the start')
        self.next(self.end)
    
    @step
    def end(self):
        print('this is the end')

if __name__ == '__main__':
    MyFlow()
```

    Overwriting myflow.py


Another shell command where we run a flow:


```python
#cell_meta:show_steps=start
! python myflow.py run
```

    [35m[1mMetaflow 2.5.3[0m[35m[22m executing [0m[31m[1mMyFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:hamel[0m[35m[22m[K[0m[35m[22m[0m
    [35m[22mValidating your flow...[K[0m[35m[22m[0m
    [32m[1m    The graph looks good![K[0m[32m[1m[0m
    [35m[22mRunning pylint...[K

Here is the same notebook, but with all of the preprocessors that we defined in this module.  Additionally, we hide the input of the last cell which prints `hello, you should not see the print statement...` by using the built in `TagRemovePreprocessor`:

In [53]:
exp = get_mdx_exporter()
print(exp.from_filename('test_files/example_input.ipynb')[0])

---
title: my hello page title
description: my hello page description
hide_table_of_contents: true
---



## This is a test notebook

This is a shell command:


```bash
echo hello
```

<CodeOutputBlock lang="bash">

```
    hello
```

</CodeOutputBlock>

We are writing a python script to disk:


```py title="myflow.py"
from metaflow import FlowSpec, step

class MyFlow(FlowSpec):
    
    @step
    def start(self):
        print('this is the start')
        self.next(self.end)
    
    @step
    def end(self):
        print('this is the end')

if __name__ == '__main__':
    MyFlow()
```

Another shell command where we run a flow:


```bash
python myflow.py run
```

<CodeOutputBlock lang="bash">

```
    ...
     [1646981557065941/start/1 (pid 54733)] Task is starting.
     [1646981557065941/start/1 (pid 54733)] this is the start
     [1646981557065941/start/1 (pid 54733)] Task finished successfully.
    ...
```

</CodeOutputBlock>

This is a normal python cell:


```python
a = 2
a
```

