# Custom Markdown Parser

Inputs markdown, outputs XML based on a custom schema using regex.

Next steps:
- Code batch processing (i.e. open all files in one directory, output to another).
- Automate transfer of process files to new directory to simplify workflow.

In [1]:
# Notes on Version 2 (From General Hanley):

# Advocates transcending the "line-by-line approach":
# If you give re.sub the flags=re.MULTILINE option, then ^ and $ will match the
# beginning and end of lines, rather than the whole string. That way you
# can do it for the whole file at once. This should in principle be faster too.

In [2]:
import re, os, os.path, shutil

# custom functions:
import parser_functions as pf

### File Paths

Nice [[explanation of using the os library]](https://automatetheboringstuff.com/chapter8/).

In [3]:
# Input Folder
hdir = os.path.expanduser('~')
md_rel_path = "/Box/Notes/Primary_Sources/transcription_markdown_drafting_stage1"
md_path = hdir + md_rel_path

# Destination Folder
xml_rel_path = r"/Box/Notes/Primary_sources/xml_notes_stage2/parser_depository"
xml_path = hdir + xml_rel_path

# Folder to archive old Markdown files
archive_rel_path = "/Box/Notes/Primary_Sources/transcription_markdown_drafting_stage1/archive_docs_now_at_xml_stage_do_not_use"
archive_path = hdir + archive_rel_path

print ("Files currently in input folder ", os.path.dirname(md_path), ":")
os.listdir(md_path)


Files currently in input folder  /Users/kribblesworth/Box/Notes/Primary_Sources :


['archive_docs_now_at_xml_stage_do_not_use',
 'document_conversion_backlog',
 'ser902.txt']

In [4]:
# Minor: note that os.path.dirname lists the name of the parent folder, not the targeted one
print ("Files currently in destination folder ", os.path.dirname(xml_path), ":")

os.listdir(xml_path)

Files currently in destination folder  /Users/kribblesworth/Box/Notes/Primary_sources/xml_notes_stage2 :


['ser179.xml',
 'ser183.xml',
 'ser187.xml',
 'ser212.xml',
 'ser215.xml',
 'ser237.xml',
 'ser537.xml',
 'ser560.xml',
 'ser561.xml',
 'ser596.xml',
 'ser626.xml',
 'ser706.xml',
 'ser72.xml',
 'ser808.xml',
 'ser809.xml',
 'ser811.xml',
 'ser812.xml',
 'ser813.xml',
 'ser814.xml',
 'ser815.xml',
 'ser816.xml',
 'ser817.xml',
 'ser818.xml',
 'ser842.xml',
 'ser843.xml',
 'ser857.xml',
 'ser876.xml',
 'ser877.xml',
 'ser898.xml',
 'ser91.xml']

## The Parser

In [8]:
# Making sure in correct directory:
os.chdir(md_path)

# Test if it is all functioning properly:
print (pf.parse_md(os.listdir(md_path)[2]))


<?xml-model href="../../../../../Projects/xml_development_eurasia/schemas/persian_documents_schema_basic.rnc" type="application/relax-ng-compact-syntax"?>
    <document serial = "902">
    	
	<div>
	^ Invocatio
		<lb/>باسم سلحانُ
	
	---
		<lb/>متصدیان مهمات حال و استقبال پرگنۀ <flag>دهار</flag> سرکار <flag>مند</flag> و صوبه مالوه بدانند که
		<lb/>درینولا <flag>حقیقت</flag> دولتخواهی و ابادنگاری و نیکوخدمتی <flag>پرسویم</flag> چودهری آنجا بعرض رسد که
		<lb/>در انجام کارهای سرکار خودرا معاف نمیدارد و مطابق سنه نوازش بآنمرحوم
		<lb/>موضع <flag>اهو</flag> و غیره ده قریه پانصد و پنجاه بیگه زمین و اشجار <flag>انبه</flag> و غیره بموضع مفصله ضمن
		<lb/>بر طبق نشان مرادبخش و جاگیرداران <flag>شیر</flag> در <flag>قصبه</flag> انعام و <flag>مالکار</flag> او
		<lb/>معه فرزندان مقرر است ما نیز <flag>مولضعات</flag> مذکور و اشجار مزبور و غیره را
		<lb/>در <flag>قصبه</flag> انعام او و فرزندانش مرحمت فرمودیم باید که قصبه مرقوم را
		<lb/>به <flag>نهج</flag> قدیم باو وا گذارند که محصول انرا سال بسال و فصل 

### Run parser on every file in the input directory, copy to output directory

In [9]:
for filename in os.listdir(md_path):
    if filename.endswith(".txt") or filename.endswith(".md"):
        # Make sure in input directory
        os.chdir(md_path)

        # Export filename
        output_file = "ser" + pf.serial_no(filename) + ".xml"
        
        # MD parsed into XML text for output
        output_text = pf.parse_md(filename)
        
        # Send file
        with open(xml_path + "/" + output_file, 'w+') as fout:
            fout.write(output_text)
            
        # Archived filename
        archive_file = "archived_no" + pf.serial_no(filename) + ".txt"
                       
        # Move active file to the archive folder
        shutil.move(md_path + "/" + filename, archive_path + "/" + archive_file)
       