# Purpose

A Python script to prepend a hyperlinked table of contents to a generic Markdown file, based on that file's header hierarchy. 

The workflow:

- Add TOC to Markdown file
- Create HTML from Markdown via pandoc
- Add anchors and stylesheet to HTML document
- Push to GitHub??


# Load libraries

In [1]:
from bs4 import BeautifulSoup
import fileinput
import string
import subprocess

# Add TOC to Markdown file

## Load Markdown file

In [2]:
fname = input("Filename: ")
#fname = '/home/jtk/Site/REFS/Bash-basics.md'
#fname = '/home/jtk/Site/REFS/metadata.md'
#fname = '/home/jtk/Site/REFS/databases.md'
fhand = open(fname, 'r')

## Save lines that are either headers (wanted) or code sections (unwanted)

In [3]:
# extract headers
headers_comments = list()
for row in fhand:
    if row.startswith("```") or row.startswith('#'):
        headers_comments.append(row)

print(headers_comments)

['# Precursors to databases\n', '# Database approach\n', '## ANSI-SPARC architecture\n', '# (Generations of) data models\n', '## Object-based data models\n', '## Record-based data models\n', '### (1G) Hierarchical data model\n', '### (1G) Graph data model \n', '### (2G) Relational data model\n', '## Physical data models\n']


## Identify indices of code sections

In [4]:
comments_indices = list()

# gets indices of ```
for i in range(len(headers_comments)):
    if headers_comments[i].startswith("```"):
        comments_indices.append(i)
        
print(comments_indices)

[]


In [5]:
comment_ranges = list()
end = len(comments_indices)

# breaks ``` indices into pairs representing one code chunk
for i in range(0,end,2):
    comment_ranges.append([comments_indices[i],comments_indices[i+1]])

print(comment_ranges)

[]


In [6]:
ignore_me = list()

# generates list of indexes to ignore
for i in comment_ranges:
    ignore_me.extend(list(range(i[0], i[1]+1, 1)))
                     
print(ignore_me)

[]


## Save only headers

In [7]:
headers = list()

# drops comments and code markers
for line in enumerate(headers_comments):
    if line[0] not in ignore_me:
        headers.append(line[1])
        
for h in headers:
    print(h)

# Precursors to databases

# Database approach

## ANSI-SPARC architecture

# (Generations of) data models

## Object-based data models

## Record-based data models

### (1G) Hierarchical data model

### (1G) Graph data model 

### (2G) Relational data model

## Physical data models



## Construct TOC from headers

In [8]:
TOC = list()

for h in headers:
    
    # I need this
    hsplit = h.split(' ')
    
    # set the indentation
    hlevel = len(hsplit[0])
    if hlevel > 1:
        space = "\t"*(hlevel-1)
    else:
        space = ""
    
    # set the anchor name and link text
    aname = "-".join(hsplit[1:])
    aname = aname.lower()[:-1]
    lname = " ".join(hsplit[1:])
    lname = lname[:-1]
    
    # construct (indented) bullet point
    TOC.append(space+'- <a href="#'+aname+'">'+lname+'</a>\n')

In [9]:
for item in TOC:
    print(item)

- <a href="#precursors-to-databases">Precursors to databases</a>

- <a href="#database-approach">Database approach</a>

	- <a href="#ansi-sparc-architecture">ANSI-SPARC architecture</a>

- <a href="#(generations-of)-data-models">(Generations of) data models</a>

	- <a href="#object-based-data-models">Object-based data models</a>

	- <a href="#record-based-data-models">Record-based data models</a>

		- <a href="#(1g)-hierarchical-data-model">(1G) Hierarchical data model</a>

		- <a href="#(1g)-graph-data-model-">(1G) Graph data model </a>

		- <a href="#(2g)-relational-data-model">(2G) Relational data model</a>

	- <a href="#physical-data-models">Physical data models</a>



## Prepend to file

In [10]:
#foname = "TOC_"+fname
foname = fname[:-3]+"1.md"
fout = open(foname, "w")
fout.write()
for row in TOC:
    fout.write(row)

fout.write("\n")
    
fhand = open(fname, 'r')
for row in fhand:
    fout.write(row)
    
fout.close()

# Create HTML from MD using pandoc

In [11]:
# https://stackoverflow.com/questions/26236126/how-to-run-bash-command-inside-python-script
html_out = foname[:-3]+'.html'
subprocess.run(['pandoc',foname, '-f', 'markdown', '-t', 'html', '-s', '-o', html_out])

CompletedProcess(args=['pandoc', '/home/jtk/Site/REFS/databases1.md', '-f', 'markdown', '-t', 'html', '-s', '-o', '/home/jtk/Site/REFS/databases1.html'], returncode=0)

# Add anchors and stylesheet to HTML

## Create BeautifulSoup object

In [12]:
fhand = open(html_out, 'r')
my_soup = BeautifulSoup(fhand, "html.parser")
#print(my_soup.prettify())

## Make every header an anchor

In [13]:
headers = my_soup.find_all(["h1","h2","h3","h4","h5","h6"])
for h in headers:
    h.string.wrap(my_soup.new_tag("a"))
    del h['id'] 
    h.a['name'] = "-".join(h.get_text().lower().split(" "))

In [14]:
#print(my_soup.prettify())

## Add CSS stylesheet to &lt;head&gt;

In [15]:
link = my_soup.new_tag("link")
link["rel"] = "stylesheet"
link["type"] = "text/css"
link['href'] = "refs.css"
my_soup.head.style.replace_with(link)

<style type="text/css">code{white-space: pre;}</style>

In [16]:
#print(my_soup.prettify())

# Write out

In [17]:
fhand = open(html_out, 'w')
fhand.write(my_soup.prettify())

8181

In [18]:
# not sure why I need this twice????

fhand = open(html_out, 'w')
fhand.write(my_soup.prettify())

8181