# Custom Markdown Parser

Inputs markdown, outputs XML based on a custom schema using regex.

Next steps:
- Code batch processing (i.e. open all files in one directory, output to another).
- Automate transfer of process files to new directory to simplify workflow.

In [196]:
# Notes on Version 2 (From General Hanley):

# Advocates transcending the "line-by-line approach":
# If you give re.sub the flags=re.MULTILINE option, then ^ and $ will match the
# beginning and end of lines, rather than the whole string. That way you
# can do it for the whole file at once. This should in principle be faster too.

In [197]:
import re, os

### File Paths

In [198]:
input_path = r"/Users/Enkidu/Box Sync/Notes/Primary Sources/transcription_markdown_drafting_stage1/"

os.listdir(input_path)

['.DS_Store',
 '.Ulysses-Group.plist',
 'archive_docs_now_at_xml_stage_do_not_use',
 'document_conversion_backlog',
 'i126-1-1951_ser626.md']

In [199]:
output_path = r"/Users/Enkidu/Box Sync/Notes/Primary Sources/xml_notes_stage2/parser_depository/"
os.listdir(output_path)

[]

In [200]:
input_file = 'ser626.md'
ret = re.match(r'[^0-9]*([0-9]+)\.md', input_file)
doc_serial, = ret.groups()
print(doc_serial)

doc_serial_xml = '<document serial = "' + doc_serial + '">'
print (doc_serial_xml)

626
<document serial = "626">


In [201]:
with open(input_file) as fin:
    text = fin.read()

Previous code, when `<lb>` was wrapped instead of a milestone:

```python

conv = re.sub(r'^ *\- (.*)$', r'<lb>\1</lb>', conv, flags=re.MULTILINE)

```

In [202]:
# General Hanley on F-strings: I made a few other changes too, the major one is useing f-strings,
# which is a great new feature. (You need to have python version >=3.6)


conv = text
conv = re.sub(r'^ *\- ', r'\t<lb/>', conv, flags=re.MULTILINE)
conv = re.sub(r'\----', r'</div>\n<div>', conv, flags=re.MULTILINE)
conv = re.sub(r'(\*[^\*\n]+\*)', r'<flag>\1</flag>', conv)
conv = re.sub(r'\*', r'', conv)
conv = re.sub(r'^> (.*)$', r'<!-- \1 -->', conv, flags=re.MULTILINE)
# print(conv)

In [203]:
body = '\n'.join([f'\t{s}' for s in conv.split('\n')])
final = f"""
<?xml-model href="../../../../../Documents/digital_humanities/xml_development/schemas/schema_coding_primary_texts_2.0.rnc" type="application/relax-ng-compact-syntax"?>
<document>
{body}
</document>
""".strip()

final = re.sub(r'</div>', '', final, count=1)
final = re.sub (r'<document>', doc_serial_xml, final)

print(final)



<?xml-model href="../../../../Documents/digital_humanities/digital_eurasia/schemas/schema_coding_primary_texts_2.0.rnc" type="application/relax-ng-compact-syntax"?>
<document serial = "626">
	
	
	<div> 
	
	<!-- (First glance) Statement that ulama are in agreement that the vocal Zikr of Ahmad Yasawi is legitimate, based on a gathering of them. Seal marks this from 1284/1866. -->
	
		<lb/>درینمسئله که جماعۀ از مسلمین که ممتمسکین بسلسلۀ شریفه حضرت سلطان العافین قدوة المسلاکین برهان الشریفه و الحق و الدین قطب
		<lb/>الاولیاء و صفوة الاصفیاء <flag>اعنی</flag> حضرت خواجه احمد قدس الله تعالی روحه العزیز بطهارت  <flag>بادم</flag> تمام در مساجد و زوایای در مکان طاهر جمع شده
		<lb/>از سر صمیمی صدق و ا اخلاص بنیت صحیحه خالیاً عن الریا  و الاعواض بجهت طالبین و تعمیلی و آداب <flag>مسترسدین</flag> و ترغیب سائر ایشان علی وجه الاعتبار حثا
		<lb/>علی طاعة الله و ابتغاء علی مرضات الله که از جملۀ اعظم اسماء الله هوست و بفارسیه ذکر آرّه میگویند بر سبیل چهره علانیه  بر ان استغال مینمایند کما طریقه
		<lb/>ش

In [204]:
output_file = 'parser_output.xml'
with open(output_file, 'w+') as fout:
    fout.write(final)