In [None]:
# default_exp convert

# Convert
> The Jupyter Notebooks need to be turned into either Markdown or HTML documents to be compatible with the Medium API. Quite good tools to do just this already exists such as Jupyter's `nbconvert` so we are going to use just that.

In [None]:
# hide
from nbdev.showdoc import *

In [None]:
#export 
from nbconvert import MarkdownExporter, FilenameExtension
from nbconvert.writers import FilesWriter

## Jupyter nbconvert

Jupyter's nbconvert is a well established tool created and maintained by the core jupyter developers. There is no need to reinvent the wheel, hence we will be using `nbconvert`'s python API to convert Jupyter Notebook to Markdown documents. You can read more about `nbconvert`'s function in the [official documentation](https://nbconvert.readthedocs.io/en/latest/nbconvert_library.html)

We will be using `Exporter`, namely the `MarkdownExporter` which can read a python notebook and extract the main body (text) and resources (images, etc). Let's first see the basics of how it works and then make a thin wrapper function around it.

In [None]:
m = MarkdownExporter()

In [None]:
body, resources = m.from_filename('../tests/test-notebook.ipynb')

All notebook exporters return a tuple containing the body and the resources of the document, for instance the matplotlib image from our test notebook was stored as `output_4_1.png`

In [None]:
resources['outputs']['output_4_0.png'];
resources['outputs'].keys()

dict_keys(['output_4_0.png', 'output_6_0.png', 'output_8_0.png', 'output_8_1.png'])

Also it is important to know that so far, the notebook markdown representation only exists as a python object and no files have been written.

In [None]:
def nb2md_draft(notebook:str):
    """
    Paper thin wrapper around nbconvert.MarkdownExporter. This function takes the path to a jupyter
    notebook and passes it to `MarkdownExporter().from_filename` which returns the body and resources
    of the document
    """
    m = MarkdownExporter()
    body, resources = m.from_filename(notebook)
    return body, resources

This is a very basic notebook-to-markdown converter that is greatly improved and features are added to it with preprocessors further down in this module, hence this `nb2md()` function is not the one that will be exported and is hence marked as draft.

In [None]:
b, r = nb2md_draft('../tests/test-notebook.ipynb')

## Setting up a module logger

In [None]:
#export
import logging

def init_convert_logger(level = logging.INFO):
    # create logger
    logger = logging.getLogger('converter')
    logger.setLevel(level)

    # create console handler and set level to debug
    ch = logging.StreamHandler()
    ch.setLevel(level)

    # create formatter
    formatter = logging.Formatter('%(name)s:%(levelname)s - %(message)s')

    # add formatter to ch
    ch.setFormatter(formatter)

    # add ch to logger
    logger.addHandler(ch)

In [None]:
#export 
init_convert_logger(logging.DEBUG)

In [None]:
# 'application' code
logger = logging.getLogger('converter')
logger.debug('debug message')
logger.info('info message')
logger.warning('warn message')
logger.error('error message')
logger.critical('critical message')

converter:DEBUG - debug message
converter:INFO - info message
converter:ERROR - error message
converter:CRITICAL - critical message


## Writing Notebook to file

We use the `FilesWriter` object to write the resulting markdown file onto our laptop's storage. We can precise the `build_directory` attribute ([see more Writer options](https://nbconvert.readthedocs.io/en/latest/config_options.html#writer-options)) to indicate where we would like to store our Notebook and the auxiliary files (images, etc). The FilesWriter is "aggresive", meaning it will overwrite whatever files exists if there is a directory or filename clash. Lastly, it is also possible to write a custom Writer such as `MediumWriter` that renders the document and then uploads it to Medium but because I am learning I'd rather see every step in the pipeline.

In [None]:
f = FilesWriter(build_directory = 'Rendered/')

Conveniently, the `write()` method of `FilesWriter` returns the output path.

In [None]:
f.write(output = body, 
        resources = resources,
        notebook_name = 'test-notebook')

'Rendered/test-notebook.md'

### Simple writing function

In [None]:
#export
def WriteMarkdown(body, resources, dir_path = None, filename = None):
    """
    body & resources are the output of any Jupyter nbconvert `Exporter`.
    dir_path should be a relative path with respect to the current working directory. 
    If dir_path is not passed, the output document and its auxiliary files will be written
    to the same location than the input jupyter notebook
    filename should be the output document's name
    
    This function returns the location of the newly written file
    """
    logger = logging.getLogger('converter')
    markdown_location = FilesWriter(build_directory = '' if dir_path is None else dir_path) \
    .write(
        output = body,
        resources = resources,
        notebook_name = filename
    )
    logger.info(f"Markdown document written to {markdown_location}")

#### Example 1 - Write to Jupyter's Notebook directory

In [None]:
WriteMarkdown(body, resources, filename = 'test-notebook')

converter:INFO - Markdown document written to ../tests/test-notebook.md


#### Example 2 - Write to new directory

In [None]:
WriteMarkdown(body, resources, dir_path= 'Docs', filename= 'test-notebook')

converter:INFO - Markdown document written to Docs/test-notebook.md


#### Example 3 - Write to directory with subdirectory

In [None]:
WriteMarkdown(body, resources, dir_path= 'Docs/Attempt1', filename= 'test-notebook')

converter:INFO - Markdown document written to Docs/Attempt1/test-notebook.md


#### Example 4 - Write outside the current working directory

In [None]:
WriteMarkdown(body, resources, dir_path= '../Docs', filename= 'test-notebook')

converter:INFO - Markdown document written to ../Docs/test-notebook.md


In [None]:
# hide
!rm -rf Docs/ Rendered/ ../Docs/ ../tests/test-notebook.md ../tests/output*.png

## Handling special tags

### Hide tags - Remove cell if cell has no output

We may wish certain markdown or code cells to not be present in the output document. To achieve this we can use `nbconvert`'s [`RegexRemovePreprocessor`](https://nbconvert.readthedocs.io/en/latest/removing_cells.html#removing-cells-using-regular-expressions-on-cell-content). preprocessors such as this one can either be registered to an `Exporter`(see [how](https://nbconvert.readthedocs.io/en/latest/api/exporters.html#nbconvert.exporters.Exporter.register_preprocessor)) or passed as part of a config (see [how](https://nbconvert.readthedocs.io/en/latest/removing_cells.html#removing-pieces-of-cells-using-cell-tags)). 

In [None]:
#export 
from nbconvert.preprocessors import RegexRemovePreprocessor

In [None]:
m = MarkdownExporter()
m.register_preprocessor(RegexRemovePreprocessor(patterns = ['^#\s*hide-cell']), enabled = True);

__Funnily enough__, the `RegexRemovePreprocessor` [only hides cells that have the tag AND that do no produce an output](https://github.com/jupyter/nbconvert/issues/1091). For example:
```python 
#hide-cell
a = 1
```
would be removed, but:
```python 
#hide-cell
a = 1
print(a) # or simply a
````
would _not_ be removed.

### Clear Output - Remove cell's output but keep cell's content

The standard preprocessors aren't really useful for what I want to do. [`RegexRemovePreprocessors`](https://github.com/jupyter/nbconvert/blob/master/nbconvert/preprocessors/regexremove.py) only remove cells if they have no output in addition to matching the pattern(s) specified. The [`ClearOuputPreprocessor`](https://github.com/jupyter/nbconvert/blob/master/nbconvert/preprocessors/clearoutput.py) removes all outputs from a notebook. Hence I am just going to write a custom preprocessor that is able to hide either a cell's source, a cell's output or the whole cell based on pattern matching performed on a cell's source. After some investigation I realised that best way to achieve this was using [cell tags](https://stackoverflow.com/a/48084050/12821043), though I do not like Jupyter's current tag environment. I do not like them because you have to use the GUI entirely to add tags to a cell, navigating to the the top sidebar, then the **View** section and then the **Cell Toolbar** sub-section and finally click on **Tags** to enable this extra chunky section added to all your cells, even those you may not want to add tag onto. *Hence* I've gone for an implementation that allows for both the use of tags and the of use of text/regex based tagging in the custom preprocessor `HidePreprocessor` written below.

In [None]:
#export 
from nbconvert.preprocessors import Preprocessor, TagRemovePreprocessor
from traitlets import List, Unicode, Set
import re

class HidePreprocessor(Preprocessor):
    """
    Preprocessor that hides cell's body and only keeps the output based on regex matching
    
    Regex matching is based on the [RegexRemovePreprocessor source]
    (https://github.com/jupyter/nbconvert/blob/master/nbconvert/preprocessors/regexremove.py)

    """
    
    def __init__(self, mode:str):
        self.mode = mode
        if mode not in ('source', 'output', 'cell'): 
            raise Exception(f"Mode {mode} not supported")
        self.pattern = f'^#.*%\s*hide-{mode}'
        self.logger = logging.getLogger('converter')
    
    def preprocess_cell(self, cell, resources, cell_index):
        """
        Preprocessing to apply to each cell.
        """
        
        cell, resources = self.hide_(cell, resources, cell_index)
        
        return cell, resources
    
    def hide_(self, cell, resources, cell_index):
        
        mode = self.mode
        has_keyword = re.search(self.pattern, cell.source.split('\n')[0])
        # gist handling
        if has_keyword:
            self.logger.debug(f"Found a hide-{mode} tag in cell #{cell_index}.")
            cell.metadata.tags = [f'hide-{mode}']
            
        return cell, resources
        

`nbconvert` uses [`traitlets`](https://github.com/ipython/traitlets) where I would normally expect an `__init__()` method. Luckily it is quite intuitive to work with traitlets but I do not grasp the pros and cons of using it.

In [None]:
m = MarkdownExporter()
m.register_preprocessor(HidePreprocessor(mode = 'source'), enabled = True)
m.register_preprocessor(HidePreprocessor(mode = 'output'), enabled = True)
m.register_preprocessor(HidePreprocessor(mode = 'cell'), enabled = True)
m.register_preprocessor(
    TagRemovePreprocessor(
        remove_input_tags = ('hide-source',),
        remove_all_outputs_tags = ('hide-output',),
        remove_cell_tags = ('hide-cell',),
        enabled = True)
)

<nbconvert.preprocessors.tagremove.TagRemovePreprocessor at 0x7fa3aaf72c10>

The file `test-hiding.ipynb` contains 4 cells printing the string 'My name is Jack'. The first one has no tags added. The second one has the `#hide-source` tag which results in only the output string being present in the Markdown document. The third cell has the `#hide-output` tag added to it which results in only the cell source ("the code") being present in the Markdown document. The last cell has the `#hide-cell` tag which removes the whole cell (source and output) altogether.

In [None]:
b, r = m.from_filename('../tests/test-hiding.ipynb')
print(b)

converter:DEBUG - Found a hide-source tag in cell #3.
converter:DEBUG - Found a hide-output tag in cell #1.
converter:DEBUG - Found a hide-cell tag in cell #5.



### The output of the next cell is hidden


```python
#%hide-output
print('My name is Jack')
```

### The source of the next cell is hidden

    My name is Jack


### The entire next cell is hidden



**Above** has been the exploration of how to implement hiding cells sources, cells outputs and entire cells based on text based tags. These will be added in the main `nb2md()` function at the end of this module

### Gister tags

I like syntax highlighting in Medium articles and this is only available (to my knowledge) via GitHub Gists. We will be making our own [preprocessor](https://nbconvert.readthedocs.io/en/latest/nbconvert_library.html#Using-different-preprocessors) to uploads the source code of cells that start with the special tag `# gist`. Creating a POST request to submit a Github Gist is easy enough, here we have simply translated the [GitHub API](https://docs.github.com/en/rest/reference/gists#create-a-gist) to a python request

The only thing needed to submit GET/POST request via the GitHub API is a github token. In the same way than with the Medium tokens we can have the environment variable declared in our `~/.bashrc` or `~/.zshrc` files. The documentation to create a token can be found [in this page](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token)

In [None]:
#export
import os
def check_gh_auth():
    logger = logging.getLogger('converter')
    if not os.getenv('GITHUB_TOKEN'):
        logger.warning('Please declare your GITHUB_TOKEN as an environment variable, \
        read more here: https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token')
        return False
    else:
        logger.debug('GITHUB_TOKEN environment variable found.')
        return True

In [None]:
#export
from requests import post
import json

def upload_gist(gistname , gistcontent, description = "", public = False):
    """
    description: Description of gist, i.e. some metatext explaining what the gist is about
    gistname: name displayed for the gist, this impacts how the file is rendered based
    on the extension (e.g. script.py, README.md, script.R, query.sql...)
    gistcontent: this maybe the name of a file or just a a string describing a program
    public: whether the gist should be public or private 
    """
    if os.path.isfile(gistcontent):
        gistname = gistcontent if gistname is None else gistname
        gistcontent =  open(gistcontent, 'r').read()
        
    post_req = post("https://api.github.com/gists",
                data = json.dumps({
                    'description': description, 
                    'files': {gistname: {'content': gistcontent}},
                    'public': False
                }),
                headers = {
                    'Authorization': f"token {os.getenv('GITHUB_TOKEN')}",
                    "Accept": "application/vnd.github.v3+json"
                }
        )

    return post_req.ok, post_req.json()['html_url']

In [None]:
upload_gist('ghapitest', gistcontent = 'CONTRIBUTING.md')

(True, 'https://gist.github.com/82bc96c8ecb1f1d309dda67cecf5dab9')

I wish to have a gister that acts like a magic function but without being a magic function, instead it's just a set of instruction that are sent to the parser like so:
```python
# gist description: My python program gistname: script.py public: false upload: source
a = 1
b = 2
c = a*b
```
where the `public`, `description` and `upload` flags are optional.

The `upload` flag exists to enable the user to upload the ouput of a command as a gist too (text file or html table/pandas dataframe). The user can specify what to upload by specifing `upload: source`, `upload: output`, `upload: both`

### Handling tables

In [None]:
import bs4
import pandas as pd

# notebook with example dataframe
source = json.load(open("../tests/test-gist-output-df.ipynb"))['cells'][0]['outputs'][0]['data']['text/html'] 
soup = bs4.BeautifulSoup(''.join(source),'lxml')
table = soup.find_all('table')
df = pd.read_html(str(table), index_col = 0)[0]
df

Unnamed: 0,a,b,c
0,1,9,hair
1,2,0,potato
2,3,1,water


We may choose to remove the index column when exporting to csv by running `df.to_csv(index = False)`.

Now that we know how how to upload a gist to Github and how to recover an html table as a pandas dataframe we can incorporate these method into our own `GisterPreprocessor`.

In [None]:
#export
import bs4
import pandas as pd

class GisterPreprocessor(Preprocessor):
    """
    Preprocessor that detects the presence of the #gist tag in a Jupyter Notebook cell,
    uploads the code in that cell as a GitHub gist for the authenticated user and replaces the original cell
    for a link to the gist in the resulting markdown file
    """
    
    pattern = '^#.*%\s*gist'
    is_auth = check_gh_auth()
    logger = logging.getLogger('converter')
    
    def get_params(self,cell, **kwargs):
        keywords = ['description', 'gistname', 'public', 'upload']
        params_string = re.search(r"%\s*gist\s*(.*)", cell.source.split('\n')[0])
        if params_string is None:
            raise Exception('Cell was labelled with a #gist tag but no parameters were passed')
        else:
            params_string = params_string.group(1)
        
        for keyword in keywords: 
            params_string = params_string.replace(keyword, f'\n{keyword}')
            
        params_string = params_string.split('\n')[1:]
        params_dict = {}
        for param in params_string:
            param = param.split(':')
            # TODO write exception for when param[1] is not passed
            params_dict[param[0]] = param[1].strip()

        return params_dict 
    
    def upload_gist_from_cell(self, cell, cell_index, params, content, n_output = 0):
        """
        output_number is 0 by default (source) or positive otherwise, if positive
        it corresponds to the n_output-th cell's output element
        """       
        ok, gist_url = upload_gist(gistcontent = content, **params)
        if ok:
            self.logger.info(f"Gist {params['gistname']} succesfully uploaded!")
            if n_output == 0: append = False
            elif self.upload == 'output' and n_output > 1: append = True
            elif self.upload == 'both': append = True
            if not append:
                cell.source = f"[{gist_url}]({gist_url})"
            else:
                cell.source += f"\n[{gist_url}]({gist_url})"
        else:
            self.logger.error(f"Couldn't upload gist {params['gistname']}")
            
        return cell
    
    def handle_output(self, cell, cell_index, params):
        """
        The output of a Jupyter cell is a whole mess
        """
    
        # gist output handling
        # For each output we'll check the format and upload a separate gist
        for n_output, output in enumerate(cell.outputs):
            
            tbl_counter = 0
            # pandas tables are rendered as html tables and 
            # hence can be detected via the <table> tag
            # and also the term dataframe
            if 'data' in output.keys():
                if 'text/html' in output.data.keys():
                    html_src = ''.join(output.data['text/html'])
                    if re.search('<table.*', html_src) is not None:
                        tbl_counter += 1
                        self.logger.debug(f"Found table in cell {cell_index}, uploading...")
                        # (try) turn html table into dataframe
                        try:
                            soup = bs4.BeautifulSoup(html_src,'lxml')
                            table = soup.find_all('table')
                            payload = pd.read_html(str(table), index_col = 0)[0].to_csv(index = False)
                            params['gistname'] += ".csv"
                        except:
                            self.logger.warning('Could not turn table to dataframe, uploading as html file')
                            payload = html_src
                            params['gistname'] += ".html"

                elif 'text/plain' in output.data.keys():
                    self.logger.debug(f"Uploading output from cell {cell_index} as text file...")
                    # output.data['text/plain'] returns a list so we join in into a string
                    # on its way out
                    payload = ''.join(output.data['text/plain'])
                    params['gistname'] += ".txt"

                
            elif 'output_type' in output.keys() and output.output_type == 'stream':
                self.logger.debug(f"Detected printed output in cell {cell_index}, uploading...")
                payload = ''.join(output.text)
                params['gistname'] += ".txt"
                
            cell = self.upload_gist_from_cell(cell, cell_index, params, payload, n_output + 1)
                        
        return cell, tbl_counter 
    
    def preprocess_cell(self, cell, resources, cell_index):
        """
        Preprocessing to apply to each cell.
        """        
        self.upload = 'source' # default
        
        has_keyword = re.search(self.pattern, cell.source.split('\n')[0])
        # gist handling
        if has_keyword:
            if self.is_auth:
                params = self.get_params(cell)
                self.logger.debug(f"Detected gist tag in cell {cell_index}  with arguments: {', '.join(params.keys())}; uploading...")
                
                if 'upload' in params.keys(): 
                    self.upload = params['upload']
                    del params['upload'] 
                
                # upload cell source
                payload = '\n'.join(cell.source.split('\n')[1:])
                cell = self.upload_gist_from_cell(cell, cell_index, params, content = payload)
                
                # upload output if user chose to
                if self.upload != 'source':
                    cell, tbl_counter = self.handle_output(cell, cell_index, params)
                
                # Finally change cell_type to markdown to get rid of outputs
                # and ensure good formatting of links
                cell.cell_type = 'markdown'
                
            else: self.logger.info("Gist not uploaded as GITHUB_TOKEN could not be found.")
        
           
            
        return cell, resources
        

converter:DEBUG - GITHUB_TOKEN environment variable found.


__Note:__ The output of the *gisted* cell will be removed as the code cell is turned into a markdown cell

In [None]:
m = MarkdownExporter()
m.register_preprocessor(GisterPreprocessor(), enabled = True)


<__main__.GisterPreprocessor at 0x7fa398ded550>

In [None]:
b, r = m.from_filename('../tests/test-gist.ipynb')
print(b)

converter:DEBUG - Detected gist tag in cell 1  with arguments: description, gistname; uploading...
converter:INFO - Gist script.py succesfully uploaded!



## Uploading cells as gists            

[https://gist.github.com/6b14c3380ae500b70c15c5e013bb8ca5](https://gist.github.com/6b14c3380ae500b70c15c5e013bb8ca5)



In [None]:
WriteMarkdown(b,r, filename = 'test-gister')

converter:INFO - Markdown document written to ../tests/test-gister.md


In [None]:
b, r = m.from_filename('../tests/test-gist-output-df.ipynb')
print(b)

converter:DEBUG - Detected gist tag in cell 0  with arguments: description, gistname, upload; uploading...
converter:INFO - Gist pandas.py succesfully uploaded!
converter:DEBUG - Found table in cell 0, uploading...
converter:INFO - Gist pandas.py.csv succesfully uploaded!



[https://gist.github.com/f2901e759de6a82bb2c437fbd38d6ddf](https://gist.github.com/f2901e759de6a82bb2c437fbd38d6ddf)
[https://gist.github.com/d69379d0f9c7f9b4ecd72f0a5d469419](https://gist.github.com/d69379d0f9c7f9b4ecd72f0a5d469419)



In [None]:
WriteMarkdown(b,r, filename = 'test-gist-output-df')

converter:INFO - Markdown document written to ../tests/test-gist-output-df.md


In [None]:
b, r = m.from_filename('../tests/test-gist-output-print.ipynb')
print(b)

converter:DEBUG - Detected gist tag in cell 1  with arguments: description, gistname, upload; uploading...
converter:INFO - Gist script.py succesfully uploaded!
converter:DEBUG - Detected printed output in cell 1, uploading...
converter:INFO - Gist script.py.txt succesfully uploaded!



## Uploading cells as gists            

[https://gist.github.com/f12733e4e38ad47e86848c458a820304](https://gist.github.com/f12733e4e38ad47e86848c458a820304)
[https://gist.github.com/e85c02538d89429ee234ad24af05a413](https://gist.github.com/e85c02538d89429ee234ad24af05a413)



In [None]:
b, r = m.from_filename('../tests/test-gist-multi-mode-output.ipynb')
print(b)

converter:DEBUG - Detected gist tag in cell 1  with arguments: description, gistname, upload; uploading...
converter:INFO - Gist script.py succesfully uploaded!
converter:DEBUG - Detected printed output in cell 1, uploading...
converter:INFO - Gist script.py.txt succesfully uploaded!
converter:DEBUG - Found table in cell 1, uploading...
converter:INFO - Gist script.py.txt.csv succesfully uploaded!



## Uploading cells as gists            

[https://gist.github.com/59e730fbda8e234459bc298e3c68f5c8](https://gist.github.com/59e730fbda8e234459bc298e3c68f5c8)
[https://gist.github.com/5ddd5df86f505c2db8712e4e50816031](https://gist.github.com/5ddd5df86f505c2db8712e4e50816031)
[https://gist.github.com/63016f4c2d2d60aa1f15646bbb63aaba](https://gist.github.com/63016f4c2d2d60aa1f15646bbb63aaba)



In [None]:
# hide
!rm -rf ../tests/test-gister.md ../tests/test-notebook.md ../tests/test-dataframe.md

In [None]:
# hide
# remove gists
!for id in $(gh gist list --limit 40 | awk '/description: nb2medium-test/ { print $1}');do gh gist delete $id; done

### Image preprocessor

As we have noticed before when we use an `nbconvert`'s exporter on a Jupyter notebook it extracts the images from that notebook (e.g. plots) and stores them locally. We now need to take those images, upload them to Medium and replace the image with the image URL. It is very similar to what we have done with code cells.m

Notice in the cell below how the cells that containg an image have an entry such at `cell['outputs']...['data']['image/png'`]. We can detect the presence of such file and upload the image to Medium via the python Medium API we have written

In [None]:
demonb = json.load(open('../tests/test-notebook.ipynb'))
print(
    demonb['cells'][4]['outputs'][0].keys(), '\n',
    demonb['cells'][4]['outputs'][0]['data'].keys(), '\n'
)

dict_keys(['data', 'metadata', 'output_type']) 
 dict_keys(['image/png', 'text/plain']) 



In [None]:
demonb['cells'][4]['outputs'][0]['data']['image/png'][:200]

'iVBORw0KGgoAAAANSUhEUgAAAYIAAAD4CAYAAADhNOGaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAA5CklEQVR4nO3dd3hU55X48e8Z9QISQjBq'

I am going to make a bold assumption. I am going to assume that if in a given cell the user outputs an image, the user doesn't want anything else to be outputted (e.g. text, or values). 

The representation of images in raw Jupyter Notebooks is not actually that of a valid image file, the image is represents with ASCII characters. We need to use `binascii.a2b_base64` to turns the ASCII characters into binary ones which results in a valid image, which can then upload to medium. I figured this out by exploring how they extract images in `nbconvert`'s [`ExtractOutputPreprocessor`](https://github.com/jupyter/nbconvert/blob/42cfece9ed07232c3c440ad0768b6a76f667fe47/nbconvert/preprocessors/extractoutput.py)

In [None]:
#export
from nb2medium.mediumapi import post_image
from binascii import a2b_base64
from random import randint #to generate new cell ids

class ImagePreprocessor(Preprocessor):
    """
    Preprocessor that detects the presence of the image in a Jupyter Notebook cell's output,
    uploads the image to Medium 
    """
    
    logger = logging.getLogger('converter')
    
    def preprocess(self, nb, resources):
        """
        Preprocessing to apply to each cell.
        Images can either be in a cell's output as a result of a plot being generated in the code (Scenario 1)
        Or they can be passed from a local file or the internet in a Markdown cell's source (Scenario 2)
        """
        
        n_items = len(nb['cells'])
        n = 0
        n_plots = n_local_images = 0
        while n < n_items:
            cell = nb['cells'][n]
            cell, newcell, img_count1 = self.upload_image_from_cell_output(cell, n)
            cell, img_count2 = self.upload_local_image_from_md(cell, resources, n)
            n_plots += img_count1
            n_local_images += img_count2
            nb['cells'][n] = cell
            if newcell is not None:
                n_items+=1
                #write next cell
                nb['cells'].insert(n+1, newcell)
                n+=1 # skip next cell
            n+=1
        
        self.logger.info(f"Detected {n_plots} plots and {n_local_images} local images in notebook.")
        
        return nb, resources
    
    
    def upload_local_image_from_md(self, cell, resources, cell_index):
        # Scenario 2
        # extract name and path of notebook being processed
        name = resources['metadata']['name']
        path = resources['metadata']['path']
        img_counter = 0
        
        # regex matches the way of insert images in Markdown (e.g `![](somestring)`)
        if cell.cell_type == 'markdown' and re.match('!\[\]\(.*\)', cell.source):
            # figure out if path is local or online, if local upload
            # we use a capture group in the regex to directly extract the content 
            # the image tag
            imgs = re.findall('!\[\]\((.*)\)', cell.source)
            for img in imgs:
                img_path = os.path.join(path, img)
                if os.path.isfile(img_path): # local file
                    img_counter += 1
                    self.logger.debug(msg = f"Detected {img_counter} local image(s) in cell {cell_index}, uploading...")
                    upload = post_image(filename = img_path)
                    ok, upload = upload.ok, upload.json()
                    
                    if ok:
                        url = upload['data']['url']
                        self.logger.debug(msg = f"Image succesfully uploaded to {url}")
                        cell.source = re.sub(pattern = img,
                                         repl = url,
                                         string = cell.source)
                    else:
                        self.logger.error(msg = "Could not upload image to Medium!")
                    
        return cell, img_counter
        
    def upload_image_from_cell_output(self, cell, cell_index):
        # Scenario 1
        newcell = None
        img_counter = 0
        if 'outputs' in cell.keys():
            # Iterate thorugh each output of the cell, if at least 1 image is found 
            # clear all other content; for each img upload and replace with url
            # change cell to markdown (output removed in this operation)
            newcell = {'cell_type': 'markdown', 'id': str(randint(1e4, 1e5)), 'metadata':{}}
            for output in cell.outputs:
                # matplotlib images seem to be already in an ASCII PNG format
                # hence we are going with that
                if 'image/png' in output.data.keys():
                    img_counter += 1
                    
                    self.logger.debug(msg = f"Detected {img_counter} plot(s) in cell {cell_index}, uploading...")
                    img_bin = a2b_base64(output.data['image/png'])
                    
                    img = post_image(img = img_bin)
                    ok, img = img.ok, img.json()
                    if ok:
                        url = img['data']['url']
                        self.logger.debug(msg = f"Image succesfully uploaded to {url}")
                        newcell['source'] = f"![]({url})\n" if img_counter == 1 \
                        else '\n'.join([newcell['source'], f"![]({url})\n"])
                    else:
                        self.logger.error(msg = "Could not upload image to Medium!")
                        
        
            if img_counter > 0: cell.outputs = []
            elif img_counter == 0: newcell = None
                
        return cell, newcell, img_counter
        

In [None]:
m = MarkdownExporter()
m.register_preprocessor(ImagePreprocessor(), enabled = True)

<__main__.ImagePreprocessor at 0x7fa398e48970>

As we would expect nothing happens when we have a notebook containing online images in it's markdown cells. Medium can access these directly from the internet

In [None]:
b, r = m.from_filename('../tests/test-md-online-image.ipynb')

converter:INFO - Detected 0 plots and 0 local images in notebook.


But! If our preprocessor finds offline images, it will upload them to Medium

In [None]:
b, r = m.from_filename('../tests/test-md-offline-image.ipynb')

converter:DEBUG - Detected 1 local image(s) in cell 0, uploading...
converter:DEBUG - Image succesfully uploaded to https://cdn-images-1.medium.com/proxy/1*xYdnXpwz3wapR0XTS4aP6Q.png
converter:INFO - Detected 0 plots and 1 local images in notebook.


And finally, if it detects plots such as those coming out of matplotlib, the plot will be uploaded without ever writing the image to memory 😮. In future implementations, more image types can be handles such as plotly interactive plots

In [None]:
b, r = m.from_filename('../tests/test-matplotlib.ipynb')

converter:DEBUG - Detected 1 plot(s) in cell 2, uploading...
converter:DEBUG - Image succesfully uploaded to https://cdn-images-1.medium.com/proxy/1*Sr5Vt6FDDMwnWQgU8FV9Rw.png
converter:INFO - Detected 1 plots and 0 local images in notebook.


In [None]:
b, r = m.from_filename('../tests/test-multi-matplotlib.ipynb')

converter:DEBUG - Detected 1 plot(s) in cell 2, uploading...
converter:DEBUG - Image succesfully uploaded to https://cdn-images-1.medium.com/proxy/1*Sr5Vt6FDDMwnWQgU8FV9Rw.png
converter:DEBUG - Detected 2 plot(s) in cell 2, uploading...
converter:DEBUG - Image succesfully uploaded to https://cdn-images-1.medium.com/proxy/1*ptKG19-sFG7x90-eg5VkoQ.png
converter:INFO - Detected 2 plots and 0 local images in notebook.


## Wrapping it all together

Given all the work we have done making the final function that wraps eveything together is easy! We combine all the tools we have built in this module in the `uploader` module

In [None]:
#hide
from nbdev.export import *
notebook2script()