In [None]:
# default_exp convert

# Convert
> The Jupyter Notebooks need to be turned into either Markdown or HTML documents to be compatible with the Medium API. Quite good tools to do just this already exists such as Jupyter's `nbconvert` or nbdev's `nbdev_nb2md` so we are not going to reinvent the wheel.

In [None]:
# hide
from nbdev.showdoc import *

In [None]:
#export 
from nbconvert import MarkdownExporter, FilenameExtension
from nbconvert.writers import FilesWriter

## Jupyter nbconvert

Jupyter's nbconvert is a well established tool created and maintained by the core jupyter developers. There is no need to reinvent the wheel, hence we will be using `nbconvert`'s python API to convert Jupyter Notebook to Markdown documents. You can read more about `nbconvert`'s function in the [official documentation](https://nbconvert.readthedocs.io/en/latest/nbconvert_library.html)

We will be using `Exporter`, namely the `MarkdownExporter` which can read a python notebook and extract the main body (text) and resources (images, etc). Let's first see the basics of how it works and then make a thin wrapper function around it.

In [None]:
m = MarkdownExporter()

In [None]:
body, resources = m.from_filename('../tests/test-notebook.ipynb')

All notebook exporters return a tuple containing the body and the resources of the document, for instance the matplotlib image from our test notebook was stored as `output_4_1.png`

In [None]:
resources['outputs']['output_4_1.png'];
resources['outputs'].keys()

dict_keys(['output_4_1.png', 'output_6_0.png', 'output_12_0.png', 'output_12_1.png'])

Also it is important to know that so far, the notebook markdown representation only exists as a python object and no files have been written.

In [None]:
def nb2md_draft(notebook:str):
    """
    Paper thin wrapper around nbconvert.MarkdownExporter. This function takes the path to a jupyter
    notebook and passes it to `MarkdownExporter().from_filename` which returns the body and resources
    of the document
    """
    m = MarkdownExporter()
    body, resources = m.from_filename(notebook)
    return body, resources

This is a very basic notebook-to-markdown converter that is greatly improved and features are added to it with preprocessors further down in this module, hence this `nb2md()` function is not the one that will be exported and is hence marked as draft.

In [None]:
b, r = nb2md_draft('../tests/test-notebook.ipynb')

## Writing Notebook to file

We use the `FilesWriter` object to write the resulting markdown file onto our laptop's storage. We can precise the `build_directory` attribute ([see more Writer options](https://nbconvert.readthedocs.io/en/latest/config_options.html#writer-options)) to indicate where we would like to store our Notebook and the auxiliary files (images, etc). The FilesWriter is "aggresive", meaning it will overwrite whatever files exists if there is a directory or filename clash. Lastly, it is also possible to write a custom Writer such as `MediumWriter` that renders the document and then uploads it to Medium but because I am learning I'd rather see every step in the pipeline.

In [None]:
f = FilesWriter(build_directory = 'Rendered/')

Conveniently, the `write()` method of `FilesWriter` returns the output path.

In [None]:
f.write(output = body, 
        resources = resources,
        notebook_name = 'test-notebook')

'Rendered/test-notebook.md'

### Simple writing function

In [None]:
#export
def WriteMarkdown(body, resources, dir_path = None, filename = None):
    """
    body & resources are the output of any Jupyter nbconvert `Exporter`.
    dir_path should be a relative path with respect to the current working directory. 
    If dir_path is not passed, the output document and its auxiliary files will be written
    to the same location than the input jupyter notebook
    filename should be the output document's name
    
    This function returns the location of the newly written file
    """
    return FilesWriter(build_directory = '' if dir_path is None else dir_path) \
    .write(
        output = body,
        resources = resources,
        notebook_name = filename
    )

#### Example 1 - Write to Jupyter's Notebook directory

In [None]:
WriteMarkdown(body, resources, filename = 'test-notebook')

'../tests/test-notebook.md'

#### Example 2 - Write to new directory

In [None]:
WriteMarkdown(body, resources, dir_path= 'Docs', filename= 'test-notebook')

'Docs/test-notebook.md'

#### Example 3 - Write to directory with subdirectory

In [None]:
WriteMarkdown(body, resources, dir_path= 'Docs/Attempt1', filename= 'test-notebook')

'Docs/Attempt1/test-notebook.md'

#### Example 4 - Write outside the current working directory

In [None]:
WriteMarkdown(body, resources, dir_path= '../Docs', filename= 'test-notebook')

'../Docs/test-notebook.md'

In [None]:
# hide
!rm -rf Docs/ Rendered/ ../Docs/ ../tests/test-notebook.md ../tests/output*.png

## Handling special tags

### Hide tags - Remove cell if cell has no output

We may wish certain markdown or code cells to not be present in the output document. To achieve this we can use `nbconvert`'s [`RegexRemovePreprocessor`](https://nbconvert.readthedocs.io/en/latest/removing_cells.html#removing-cells-using-regular-expressions-on-cell-content). preprocessors such as this one can either be registered to an `Exporter`(see [how](https://nbconvert.readthedocs.io/en/latest/api/exporters.html#nbconvert.exporters.Exporter.register_preprocessor)) or passed as part of a config (see [how](https://nbconvert.readthedocs.io/en/latest/removing_cells.html#removing-pieces-of-cells-using-cell-tags)). 

In [None]:
#export 
from nbconvert.preprocessors import RegexRemovePreprocessor

In [None]:
m = MarkdownExporter()
m.register_preprocessor(RegexRemovePreprocessor(patterns = ['^#\s*hide-cell']), enabled = True);

__Funnily enough__, the `RegexRemovePreprocessor` [only hides cells that have the tag AND that do no produce an output](https://github.com/jupyter/nbconvert/issues/1091). For example:
```python 
#hide-cell
a = 1
```
would be removed, but:
```python 
#hide-cell
a = 1
print(a) # or simply a
````
would _not_ be removed.

### Clear Output - Remove cell's output but keep cell's content

The standard preprocessors aren't really useful for what I want to do. [`RegexRemovePreprocessors`](https://github.com/jupyter/nbconvert/blob/master/nbconvert/preprocessors/regexremove.py) only remove cells if they have no output in addition to matching the pattern(s) specified. The [`ClearOuputPreprocessor`](https://github.com/jupyter/nbconvert/blob/master/nbconvert/preprocessors/clearoutput.py) removes all outputs from a notebook. Hence I am just going to write a custom preprocessor that is able to hide either a cell's source, a cell's output or the whole cell based on pattern matching performed on a cell's source. After some investigation I realised that best way to achieve this was using [cell tags](https://stackoverflow.com/a/48084050/12821043), though I do not like Jupyter's current tag environment. I do not like them because you have to use the GUI entirely to add tags to a cell, navigating to the the top sidebar, then the **View** section and then the **Cell Toolbar** sub-section and finally click on **Tags** to enable this extra chunky section added to all your cells, even those you may not want to add tag onto. *Hence* I've gone for an implementation that allows for both the use of tags and the of use of text/regex based tagging in the custom preprocessor `HidePreprocessor` written below.

In [None]:
#export 
from nbconvert.preprocessors import Preprocessor, TagRemovePreprocessor
from traitlets import List, Unicode, Set
import re

class HidePreprocessor(Preprocessor):
    """
    Preprocessor that hides cell's body and only keeps the output based on regex matching
    
    Regex matching is based on the [RegexRemovePreprocessor source]
    (https://github.com/jupyter/nbconvert/blob/master/nbconvert/preprocessors/regexremove.py)

    """
    
    mode = Unicode().tag(config = True) # , 'output', 'all'
    patterns = List(Unicode(), default_value=[]).tag(config=True)
    remove_metadata_fields = Set(
        {'collapsed', 'scrolled'}
    ).tag(config=True)

    def check_conditions(self, cell):
        """
        Checks that a cell matches the pattern.
        Returns: Boolean.
        True means cell should *not* be removed.
        """

        # Compile all the patterns into one: each pattern is first wrapped
        # by a non-capturing group to ensure the correct order of precedence
        # and the patterns are joined with a logical or
        pattern = re.compile('|'.join('(?:%s)' % pattern
                             for pattern in self.patterns))

        # Filter out cells that meet the pattern and have no outputs
        return pattern.match(cell.source)     
    
    def preprocess_cell(self, cell, resources, cell_index):
        """
        Preprocessing to apply to each cell.
        """
        # Skip preprocessing if the list of patterns is empty
        if not self.patterns:
            return cell, resources
        
        if self.mode == 'source': 
            cell, resources = self.hide_source(cell, resources)
        elif self.mode == 'output': 
            cell, resources = self.hide_output(cell, resources)
        elif self.mode == 'cell' or self.mode == 'all':
            cell, resources = self.hide_cell(cell, resources)
        
        return cell, resources
    
    def hide_source(self, cell, resources):
        
        if self.check_conditions(cell):
            cell.metadata.tags = ['hide-source']
            
        return cell, resources
        
    def hide_output(self, cell, resources):
        
        if cell.cell_type == 'code' and self.check_conditions(cell):
            cell.metadata.tags = ['hide-output']
                    
        return cell, resources
    
    def hide_cell(self, cell, resources):
        
        if self.check_conditions(cell):
            cell.metadata.tags = ['hide-cell']
            
        return cell, resources
    

`nbconvert` uses [`traitlets`](https://github.com/ipython/traitlets) where I would normally expect an `__init__()` method. Luckily it is quite intuitive to work with traitlets but I do not grasp the pros and cons of using it.

In [None]:
m = MarkdownExporter()
m.register_preprocessor(HidePreprocessor(mode = 'source', patterns = ['^#\s*hide-source']), enabled = True)
m.register_preprocessor(HidePreprocessor(mode = 'output', patterns = ['^#\s*hide-output']), enabled = True)
m.register_preprocessor(HidePreprocessor(mode = 'cell', patterns = ['^#\s*hide-cell']), enabled = True)
m.register_preprocessor(
    TagRemovePreprocessor(
        remove_input_tags = ('hide-source',),
        remove_all_outputs_tags = ('hide-output',),
        remove_cell_tags = ('hide-cell',),
        enabled = True)
)

<nbconvert.preprocessors.tagremove.TagRemovePreprocessor at 0x7fca56168520>

The file `test-hiding.ipynb` contains 4 cells printing the string 'My name is Jack'. The first one has no tags added. The second one has the `#hide-source` tag which results in only the output string being present in the Markdown document. The third cell has the `#hide-output` tag added to it which results in only the cell source ("the code") being present in the Markdown document. The last cell has the `#hide-cell` tag which removes the whole cell (source and output) altogether.

In [None]:
b, r = m.from_filename('../tests/test-hiding.ipynb')
print(b)


## Hiding cells              


```python
print('My name is Jack')
```

    My name is Jack


Same as above but hiding source (aka input), output and all (aka whole cell)

### The source of the next cell is hidden

    My name is Jack


### The output of the next cell is hidden


```python
#hide-output
print('My name is Jack')
```

### The entire next cell is hidden



**Above** has been the exploration of how to implement hiding cells sources, cells outputs and entire cells based on text based tags. These will be added in the main `nb2md()` function at the end of this module

### Gister tags

I like syntax highlighting in Medium articles and this is only available (to my knowledge) via GitHub Gists. We will be making our own [preprocessor](https://nbconvert.readthedocs.io/en/latest/nbconvert_library.html#Using-different-preprocessors) to uploads the source code of cells that start with the special tag `# gist`. Creating a POST request to submit a Github Gist is easy enough, here we have simply translated the [GitHub API](https://docs.github.com/en/rest/reference/gists#create-a-gist) to a python request

The only thing needed to submit GET/POST request via the GitHub API is a github token. In the same way than with the Medium tokens we can have the environment variable declared in our `~/.bashrc` or `~/.zshrc` files. The documentation to create a token can be found [in this page](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token)

In [None]:
#export
import os
def check_gh_auth():
    if not os.getenv('GITHUB_TOKEN'):
        raise Exception('Please declare your GITHUB_TOKEN as an environment variable, \
        read more here: https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token')
    else:
        return True

In [None]:
#export
from requests import post
import json

def upload_gist(description, gistcontent, gistname = None, public = False):
    """
    description: Description of gist, i.e. some metatext explaining what the gist is about
    gistname: name displayed for the gist, this impacts how the file is rendered based
    on the extension (e.g. script.py, README.md, script.R, query.sql...)
    gistcontent: this maybe the name of a file or just a a string describing a program
    public: whether the gist should be public or private 
    """
    if os.path.isfile(gistcontent):
        gistname = gistcontent if gistname is None else gistname
        gistcontent =  open(gistcontent, 'r').read()
        
    post_req = post("https://api.github.com/gists",
                data = json.dumps({
                    'description': description, 
                    'files': {gistname: {'content': gistcontent}},
                    'public': False
                }),
                headers = {
                    'Authorization': f"token {os.getenv('GITHUB_TOKEN')}",
                    "Accept": "application/vnd.github.v3+json"
                }
        ).json() # to return dict response
    return post_req['id'], post_req['html_url']

In [None]:
upload_gist('ghapitest', gistcontent = 'CONTRIBUTING.md')

('3d8ca9a4dfc644d990bcc8d65cceb37d',
 'https://gist.github.com/3d8ca9a4dfc644d990bcc8d65cceb37d')

I wish to have a gister that acts like a magic function but without being a magic function, instead it's just a set of instruction that are sent to the parser like so:
```python
# gist description: My python program gistname: script.py public: false
a = 1
b = 2
c = a*b
```
where the `public` flag is optional and default to `False`

In [None]:
#export
class GisterPreprocessor(Preprocessor):
    """
    Preprocessor that detects the presence of the #gist tag in a Jupyter Notebook cell,
    uploads the code in that cell as a GitHub gist for the authenticated user and replaces the original cell
    for a link to the gist in the resulting markdown file
    """
    
    patterns = List(Unicode(), default_value=[]).tag(config=True)

    def check_conditions(self, cell):
        """
        Checks that a cell matches the pattern.
        Returns: Boolean.
        True means cell should *not* be removed.
        """

        # Compile all the patterns into one: each pattern is first wrapped
        # by a non-capturing group to ensure the correct order of precedence
        # and the patterns are joined with a logical or
        pattern = re.compile('|'.join('(?:%s)' % pattern
                             for pattern in self.patterns))

        # Filter out cells that meet the pattern and have no outputs
        return pattern.match(cell.source.split('\n')[0]) # only matches in first line of cell     
    
    def get_params(self,cell, **kwargs):
        keywords = ['description', 'gistname', 'public']
        params_string = re.search(r"#\s*gist\s*(.*)", cell.source.split('\n')[0])
        if params_string is None:
            raise Exception('Cell was labelled with a #gist tag but no parameters were passed')
        else:
            params_string = params_string.group(1)
        
        for keyword in keywords: 
            params_string = params_string.replace(keyword, f'\n{keyword}')
            
        params_string = params_string.split('\n')[1:]
        params_dict = {}
        for param in params_string:
            param = param.split(':')
            params_dict[param[0]] = param[1].strip()

        return params_dict    
    
    def preprocess_cell(self, cell, resources, cell_index):
        """
        Preprocessing to apply to each cell.
        """
        # Skip preprocessing if the list of patterns is empty
        if not self.patterns:
            return cell, resources
        
        # gist handling
        if self.check_conditions(cell):
            params = self.get_params(cell)
            gist_id, gist_url = upload_gist(gistcontent = '\n'.join(cell.source.split('\n')[1:]), **params)
            cell.source = f"[{gist_url}]({gist_url})"
            cell.cell_type = 'markdown'
            
        return cell, resources
        

__Note:__ The output of the *gisted* cell will be removed as the code cell is turned into a markdown cell

In [None]:
m.register_preprocessor(GisterPreprocessor(patterns = ['^#\s*gist']), enabled = True)


<__main__.GisterPreprocessor at 0x7fca55ebc880>

In [None]:
b, r = m.from_filename('../tests/test-gister.ipynb')
print(b)


## Uploading cells as gists            

[https://gist.github.com/a90e764d5ed31da1de5ba5cdfe8c982f](https://gist.github.com/a90e764d5ed31da1de5ba5cdfe8c982f)

### same code no gist


```python
a = 1
b = 2
c = a*b**b
c
```




    4





In [None]:
WriteMarkdown(b,r, filename = 'test-gister')

'../tests/test-gister.md'

In [None]:
#hide
b, r = m.from_filename('../tests/test-notebook.ipynb')
WriteMarkdown(b, r, filename = 'test-notebook')

'../tests/test-notebook.md'

In [None]:
# hide
!rm -rf ../tests/test-gister.md ../tests/test-notebook.md

### Image preprocessor

As we have noticed before when we use an `nbconvert`'s exporter on a Jupyter notebook it extracts the images from that notebook (e.g. plots) and stores them locally. We now need to take those images, upload them to Medium and replace the image with the image URL. It is very similar to what we have done with code cells.m

Notice in the cell below how the cells that containg an image have an entry such at `cell['outputs']...['data']['image/png'`]. We can detect the presence of such file and upload the image to Medium via the python Medium API we have written

In [None]:
demonb = json.load(open('../tests/test-notebook.ipynb'))
print(
    demonb['cells'][4]['outputs'][0].keys(), '\n',
    demonb['cells'][4]['outputs'][0]['data'].keys(), '\n',
    demonb['cells'][4]['outputs'][1].keys(), '\n',
    demonb['cells'][4]['outputs'][1]['data'].keys()
)

dict_keys(['data', 'execution_count', 'metadata', 'output_type']) 
 dict_keys(['text/plain']) 
 dict_keys(['data', 'metadata', 'output_type']) 
 dict_keys(['image/png', 'text/plain'])


In [None]:
demonb['cells'][4]['outputs'][1]

{'data': {'image/png': 'iVBORw0KGgoAAAANSUhEUgAAAYIAAAD4CAYAAADhNOGaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAA2sklEQVR4nO3dd3hcd5Xw8e8Z9S6rV9uyLRdJtlwUx+nFtuw4jp0EAgkkhLJks0uAhez7EtiFZZdlN++yLDWQDRAIC6QAKY57SSNOcbfVXGS5qRdbzaqj+b1/aMQKRbYlzWjulPN5Hj0zc8vcM5ZH595zf0WMMSillApcNqsDUEopZS1NBEopFeA0ESilVIDTRKCUUgFOE4FSSgW4YKsDmIikpCQzffp0q8NQSimfsn///mZjTPLI5T6ZCKZPn86+ffusDkMppXyKiJwZbbmWhpRSKsBpIlBKqQCniUAppQKcJgKllApwmgiUUirAuSURiMjTItIoIqWXWC8i8kMRqRSRIyKyeNi61SJyzLnuMXfEo5RSauzcdUXwK2D1ZdbfBuQ6fx4CfgogIkHAE871ecB9IpLnppiUUkqNgVv6ERhj3hKR6ZfZZD3wazM45vV7IhIvIunAdKDSGFMFICLPObctd0dc/qKls5e9py/Q1t1HW3c/Aw5YNDWeRVPjCQsOsjo8pSZdn93B7pPNnDvfxYDDMOAwpMWFc+PsZGLDQ6wOz+d5qkNZJnBu2Otq57LRll892huIyEMMXk0wderUyYnSizgchrdONPHCvnPsKG+gf+CD80aEBdtYmpPAF5bnctX0BAuiVGpyvV/Vwgv7qtlRXk97j/0D60OChGUzErlrUSZ3LszEZhMLovR9nkoEo/12zGWWf3ChMU8BTwEUFRX59Ww6Z1u6+NILh9h/5gJTIkN4YNl0bl+QTmpsGPGRoQwMGN4/1cK7VS1sLqnjniffZe2CdL66Zh6Z8RFW

I am going to make a bold assumption. I am going to assume that if in a given cell the user outputs an image, the user doesn't want anything else to be outputted (e.g. text, or values). 

The representation of images in raw Jupyter Notebooks is not actually that of a valid image file, the image is represents with ASCII characters. We need to use `binascii.a2b_base64` to turns the ASCII characters into binary ones which results in a valid image, which can then upload to medium. I figured this out by exploring how they extract images in `nbconvert`'s [`ExtractOutputPreprocessor`](https://github.com/jupyter/nbconvert/blob/42cfece9ed07232c3c440ad0768b6a76f667fe47/nbconvert/preprocessors/extractoutput.py)

In [None]:
#export
from nb2medium.API import post_image
from binascii import a2b_base64

class ImagePreprocessor(Preprocessor):
    """
    Preprocessor that detects the presence of the image in a Jupyter Notebook cell's output,
    uploads the image to Medium 
    """
    
    def preprocess_cell(self, cell, resources, cell_index):
        """
        Preprocessing to apply to each cell.
        Images can either be in a cell's output as a result of a plot being generated in the code (Scenario 1)
        Or they can be passed from a local file or the internet in a Markdown cell's source (Scenario 2)
        """
        cell, resources = self.upload_image_from_cell_output(cell, resources)
        cell, resources = self.upload_local_image_from_md(cell, resources)
        return cell, resources
        
    def upload_local_image_from_md(self, cell, resources):
        # Scenario 2
        # extract name and path of notebook being processed
        name = resources['metadata']['name']
        path = resources['metadata']['path']
        
        # regex matches the way of insert images in Markdown (e.g `![](somestring)`)
        if cell.cell_type == 'markdown' and re.match('!\[\]\(.*\)', cell.source):
            # figure out if path is local or online, if local upload
            # we use a capture group in the regex to directly extract the content 
            # the image tag
            imgs = re.findall('!\[\]\((.*)\)', cell.source)
            for img in imgs:
                img_path = os.path.join(path, img)
                if os.path.isfile(img_path): # local file
                    upload = post_image(filename = img_path).json()
                    url = upload['data']['url']
                    cell.source = re.sub(pattern = img,
                                         repl = url,
                                         string = cell.source)
        return cell, resources
        
    def upload_image_from_cell_output(self, cell, resources):
        # Scenario 1
        if 'outputs' in cell.keys():
            # Iterate thorugh each output of the cell, if at least 1 image is found 
            # clear all other content; for each img upload and replace with url
            # change cell to markdown (output removed in this operation)
            img_counter = 0
            for output in cell.outputs:
                # matplotlib images seem to be already in an ASCII PNG format
                # hence we are going with that
                if 'image/png' in output.data.keys():
                    img_counter += 1
                    img_bin = a2b_base64(output.data['image/png'])
                    img = post_image(img = img_bin).json()
                    url = img['data']['url']
                    cell.source = f"![]({url})\n" if img_counter == 1 else '\n'.join([cell.source, f"![]({url})\n"])
                
            if img_counter > 0: cell.cell_type = 'markdown'
                

        return cell, resources
        

In [None]:
m = MarkdownExporter()
m.register_preprocessor(ImagePreprocessor(), enabled = True)

<__main__.ImagePreprocessor at 0x7fca5483fa30>

In [None]:
b, r = m.from_filename('../tests/test-notebook.ipynb')

In [None]:
WriteMarkdown(b,r, filename='test-notebook')

'../tests/test-notebook.md'

In [None]:
#hide
#clean up
!rm -rf ../tests/output*png

## Wrapping it all together

Given all the work we have done making the final function that wraps eveything together is easy! We combine all the tools we have built in this module in the `uploader` module