# Eurydice processing

Process hack `.md` files:
    
- split on `--`; first item is aways assumed to be `TEXT`;
- check for `TEXT`
- add article entries to db;
- add images to db;
- rewrite articles to useful MyST format;
- rewrite images to useful MyST format.

In [248]:
import yaml

with open("_toc.yml", "r") as f:
    toc = yaml.safe_load(f)
    
raw_files = [f["file"].strip('_') for p in toc["parts"] for f in p["chapters"] if f["file"].startswith("__")]
raw_files

['launch_of_eurydice',
 'sea_trial_shenanigans',
 'training_ship',
 'last_four_days_eurydice_verney',
 'foundering_of_the_eurydice',
 'immediate_plans',
 'struggle_to_raise_continues',
 'life_aside',
 'raising_the_eurydice',
 'life_aside_2',
 'eurydice_in_port',
 'fundraising_relief',
 'eurydice_bodies_funerals_and_memorials',
 'eurydice_poems',
 'eurydice_reminiscences',
 'inquest',
 'eurydice_court_martial',
 'eurydice_tracked_in_parliament']

In [6]:
from sqlite_utils import Database

db_name = "eurydice-demo.db"

# Uncomment the following lines to connect to a pre-existing database
db = Database(db_name)

In [7]:
# Do not run this cell if your database already exists!

# While developing the script, recreate database each time...
db = Database(db_name, recreate=True)

In [8]:
# This schema has been evolved iteratively as I have identified structure
# that can be usefully mined...

db["sources"].create({
    "url": str,
    "fn": str,
    "publication": str,
    "published_date": str, # this may range from year to actual date
    "title": int, # Title of section
    "date": str, # optional; the second date field; may be eg correspondence date
    "author": str, # attempt at provenance
    "pages": str, # or pages like
    "text": str,
},# pk=("url", "title") # Need an autoincrement; no natural key?
)

# Enable full text search
# This creates an extra virtual table (books_fts) to support the full text search
db["sources"].enable_fts(["publication","title", "text", "published_date"], create_triggers=True)

<Table sources (url, fn, publication, published_date, title, date, author, pages, text)>

In [9]:
fn = "training_ship.md"

def get_file_contents(fn):
    """Open file from filename and get file contents."""
    with open(fn) as f:
        txt = f.read().strip()
    return txt

txt = get_file_contents(fn)
txt[:100]

'# H.M.S. Eurydice, Training Ship\n\nIn 1877, H.M.S. Eurydice was refitted as a training ship. For many'

In [11]:
def get_sections(txt):
    """Get sections from file."""
    txt_sections = [s.strip('-').strip() for s in txt.split("--") if s.strip('-').strip()]
    return txt_sections

txt_sections = get_sections(txt)
txt_sections

['# H.M.S. Eurydice, Training Ship\n\nIn 1877, H.M.S. Eurydice was refitted as a training ship. For many years, the fleet had been miving over to aromoured steamships and rigged sailing ships were no longer appropriate for modern naval warfare. However, it was still felt that sailors should still be able to handle traditionally rigged vessels and that training ratings on them would be to their benefit as seamen.\n\nThe Eurydice was thus converted over to a training ship in the Spring of 1877, before embarking on a voyage to the West Indies in Autumn 1877. Rumour had it that a calamity had befallen the ship, but a telegram reported in the *Naval & Military Gazette and Weekly Chronicle of the United Service* of Wednesday 21 November 1877, received 17th of November and dated November 16th, 1877, revealed that all was well.\n\nIn passing, we might note that this news articles reveals how tegraphic cable communication was available between the West Indies and Britain by this time. A little 

In [81]:
# First section is text, but then we need to parse type

def structure_record(txt_sections):
    typ_s = [("TEXT", txt_sections[0])]

    for s in txt_sections[1:]:
        s = s.strip()
        if s.startswith("TEXT"):
            typ_s.append(("TEXT", s.replace("TEXT","").strip()))
        elif s.startswith("!["):
            typ_s.append(("IMAGE", s))
        else:
            # Should we assume text unless we get eg an http at the start of a record?
            typ_s.append(("RESOURCE", s))

    return typ_s

typ_s = structure_record(txt_sections)
typ_s

[('TEXT',
  '# H.M.S. Eurydice, Training Ship\n\nIn 1877, H.M.S. Eurydice was refitted as a training ship. For many years, the fleet had been miving over to aromoured steamships and rigged sailing ships were no longer appropriate for modern naval warfare. However, it was still felt that sailors should still be able to handle traditionally rigged vessels and that training ratings on them would be to their benefit as seamen.\n\nThe Eurydice was thus converted over to a training ship in the Spring of 1877, before embarking on a voyage to the West Indies in Autumn 1877. Rumour had it that a calamity had befallen the ship, but a telegram reported in the *Naval & Military Gazette and Weekly Chronicle of the United Service* of Wednesday 21 November 1877, received 17th of November and dated November 16th, 1877, revealed that all was well.\n\nIn passing, we might note that this news articles reveals how tegraphic cable communication was available between the West Indies and Britain by this time

In [13]:
#%pip install dateparser
# https://dateparser.readthedocs.io/en/latest/usage.html
from dateparser.date import DateDataParser

ddp = DateDataParser(languages=['en'])

In [220]:
import re

dt = "%Y-%m-%d"

def parse_sections(txt_sections, fn=None):
    """Parse file section."""
    records = []
    for section in txt_sections:
        txt_lines = [l.strip() for l in section.split('\n') if l.strip()]
        #print(txt_lines)
        record = {"fn":fn}
        for i, line in enumerate(txt_lines):
            line = line.strip()
            # This is inefficient...
            # We should test as fallback...
            try_url = line.startswith("http")
            try_date = ddp.get_date_data(line.replace('Publication date', '').strip())
            try_pages = re.search(r"^pp?\.?\s?([0-9ivxlc\?].*)", line)

            if try_url:
                record["url"] = line
            elif try_date["date_obj"]:
                if "published_date" in record:
                    record["date"] = try_date.date_obj.strftime(dt)
                else:
                    record["published_date"] = try_date.date_obj.strftime(dt)
            elif try_pages:
                record["pages"] = try_pages.group(1)
            elif not "publication" in record:
                record["publication"] = line

            # We take pages as the last item of metadata...
            if try_pages:
                break
        
        txt = f'{record["pages"]}'.join(section.split(try_pages.group(0))[1:]).strip()
        if len(txt.split("\n")[0]) > 200:
            record["title"] = txt[:100]
            record["text"] = txt[100:]
        else:
            record["title"] = txt.split("\n")[0]
            record["text"] = txt.replace(record["title"], "").strip()
        #if len(txt_lines[i+1])>200:
        #    record["title"] = txt_lines[i+1][:100]
        #    record["text"] = "\n\n".join(txt_lines[i+1:])[100:]
        #else:
        #    record["title"] = txt_lines[i+1]
        #    record["text"] = "\n\n".join(txt_lines[i+2:])
        records.append(record)

    return records

In [221]:
parse_sections([typ_s[1][1]])

[{'fn': None,
  'url': 'https://www.britishnewspaperarchive.co.uk/viewer/bl/0000069/18780403/007/0002',
  'publication': 'Hampshire Telegraph',
  'published_date': '1878-04-03',
  'pages': '2',
  'title': 'STATEMENT IN PARLIAMENT',
  'text': "In the Houseof Commons, on Monday, Captain Price asked the First Lord of the Admiralty what was the amount of ballast in Her Majesty's ship Eurydice when she left England, and was there any reason to suppose any was removed abroad; what were her angles of maximum and vanishing stability, ascertained from the experiments said to have been made on her, and were these angles communicated to Captain Hare; were the Eurydice's hammocks made buoyant by any method recommended to the the Admiralty, or were there life-belts sufficient for the officers and men; and what is the objection to the hammocks being made buoyant either by means of cork mattresses or waterproof sheets, seeing that they are so stowed as to be immediately accessible in cases of sudden 

In [337]:
from datetime import datetime
import humanize

def admonition_generator(record):
    """Generate MyST admonition markdown for the record."""
    dt_ = datetime.fromisoformat(record["published_date"])
    # The humanize package gives us things like 3rd, 27th, etc.
    daynum = humanize.ordinal(dt_.day)
    # Format the date to something like: Wednesday, April 3rd, 1878
    # %A is the day of the week (Monday, Tuesday, etc.)
    # %B is the month (March, April, etc.)
    # %Y is the 4-digit year (eg 1878)
    dt = dt_.strftime(f'%A, %B {daynum}, %Y')
    admonition = f"""
```{{admonition}} {record["title"]} - {dt}
:class: note dropdown

[{record["publication"]}]({record["url"]}), {record["published_date"]}, p. {record["pages"]}

{record["text"]}

```
"""
    return admonition

In [335]:
print(admonition_generator(parse_sections([typ_s[1][1]])[0]))


```{admonition} Wednesday, April 3rd, 1878 - STATEMENT IN PARLIAMENT 
:class: note dropdown

[Hampshire Telegraph](https://www.britishnewspaperarchive.co.uk/viewer/bl/0000069/18780403/007/0002), 1878-04-03, p. 2

In the Houseof Commons, on Monday, Captain Price asked the First Lord of the Admiralty what was the amount of ballast in Her Majesty's ship Eurydice when she left England, and was there any reason to suppose any was removed abroad; what were her angles of maximum and vanishing stability, ascertained from the experiments said to have been made on her, and were these angles communicated to Captain Hare; were the Eurydice's hammocks made buoyant by any method recommended to the the Admiralty, or were there life-belts sufficient for the officers and men; and what is the objection to the hammocks being made buoyant either by means of cork mattresses or waterproof sheets, seeing that they are so stowed as to be immediately accessible in cases of sudden emergency ? 

Mr. W. H. SMITH

In [276]:
# Parse image
# if we have an image, we need to pattern match until we get to a \n\n
# then create a figure and replace the original image matched pattern
xx="""

ssdsdd

![](../images/ILN_1878_loss_of_euridyce_apr_06_03x.png)
Illustrated London News — H.M.S. Eurydice as she lay at eight a.m. on March 25 off Dunnose Point, Isle of Wight, April 6, 1878

asaa

![](../images/ILN_1878_loss_of_euridyce_apr_06_03x.png)
Illustrated London News — H.M.S. Eurydice as she lay at eight a.m. on March 25 off Dunnose Point, Isle of Wight, April 6, 1878

asaa
"""

# The following says: .*? lazy search, (?=\n\n) lookahead to next \n\n
# re.MULTILINE | re.DOTALL give us the search over multiple lines
p = re.findall(r'!\[.*?(?=\n\n)', xx, re.MULTILINE | re.DOTALL)
p

['![](../images/ILN_1878_loss_of_euridyce_apr_06_03x.png)\nIllustrated London News — H.M.S. Eurydice as she lay at eight a.m. on March 25 off Dunnose Point, Isle of Wight, April 6, 1878',
 '![](../images/ILN_1878_loss_of_euridyce_apr_06_03x.png)\nIllustrated London News — H.M.S. Eurydice as she lay at eight a.m. on March 25 off Dunnose Point, Isle of Wight, April 6, 1878']

In [286]:
 re.findall("!\[[^\]]*\]\(([^\)]*)\)(.*)$", p[0], re.MULTILINE | re.DOTALL)[0][1]

'\nIllustrated London News — H.M.S. Eurydice as she lay at eight a.m. on March 25 off Dunnose Point, Isle of Wight, April 6, 1878'

In [302]:
def generate_figure(doc):
    
    images = re.findall(r'!\[.*?(?=\n\n)', doc, re.MULTILINE | re.DOTALL)
    for image in images:
        path =  re.findall("!\[[^\]]*\]\(([^\)]*)\)(.*)$", image, re.MULTILINE | re.DOTALL)
        if not path:
            continue
        txt = f"""
```{{figure}} {path[0][0]}
---
---
{path[0][1]}
```

"""
        doc = doc.replace(image, txt)
    return doc

In [303]:
# Parse types

def create_admontions(typ_s):
    parsed = []
    for s in typ_s:
        if s[0] =="TEXT" or s[0]=="IMAGE" or s[0].startswith("!["):
            # parse image
            parsed.append( s )
        elif s[0] == "RESOURCE":
            # parse resource
            # Put things into an admonition block
            #print("\n\n\*****\n"+s[1])
            _parsed = admonition_generator(parse_sections([s[1]])[0])

            parsed.append( (s[0], _parsed) )
        else:
            # This should be null
            pass

    return parsed

parsed = create_admontions(typ_s)
parsed

[('TEXT', '# Parliamentary Mentions'),
 ('RESOURCE',
  "\n```{admonition} STATEMENT IN PARLIAMENT\n:class: note dropdown\n\n[Hampshire Telegraph](https://www.britishnewspaperarchive.co.uk/viewer/bl/0000069/18780403/007/0002), 1878-04-03, p. 2\n\nIn the Houseof Commons, on Monday, Captain Price asked the First Lord of the Admiralty what was the amount of ballast in Her Majesty's ship Eurydice when she left England, and was there any reason to suppose any was removed abroad; what were her angles of maximum and vanishing stability, ascertained from the experiments said to have been made on her, and were these angles communicated to Captain Hare; were the Eurydice's hammocks made buoyant by any method recommended to the the Admiralty, or were there life-belts sufficient for the officers and men; and what is the objection to the hammocks being made buoyant either by means of cork mattresses or waterproof sheets, seeing that they are so stowed as to be immediately accessible in cases of sudd

In [304]:
myst_txt = "\n\n".join([t[1] for t in parsed ])

with open("test.md", "w") as f:
    f.write(myst_txt)

In [338]:
for f in raw_files:
    txt = get_file_contents(f"{f}.md")
    txt_sections = get_sections(txt)
    typ_s = structure_record(txt_sections)
    parsed = create_admontions(typ_s)
    with open(f"__{f}.md", "w") as f:
        myst_txt = "\n\n".join([t[1] for t in parsed ])
        myst_txt = generate_figure(myst_txt)
        f.write(myst_txt)

In [313]:
f

'life_aside_2'