# Eurydice processing

Process hack `.md` files:
    
- split on `--`; first item is aways assumed to be `TEXT`;
- check for `TEXT`
- add article entries to db;
- add images to db;
- rewrite articles to useful MyST format;
- rewrite images to useful MyST format.

In [72]:
import yaml

with open("_toc.yml", "r") as f:
    toc = yaml.safe_load(f)
    
raw_files = [f["file"].strip('_') for p in toc["parts"] for f in p["chapters"] if f["file"].startswith("__")]
raw_files

['launch_of_eurydice',
 'sea_trial_shenanigans',
 'training_ship',
 'last_four_days_eurydice_verney',
 'foundering_of_the_eurydice',
 'inquest',
 'immediate_plans',
 'struggle_to_raise_continues',
 'life_aside',
 'raising_the_eurydice',
 'life_aside_2',
 'eurydice_court_martial',
 'eurydice_in_port',
 'fundraising_relief',
 'eurydice_bodies_funerals_and_memorials',
 'eurydice_poems',
 'eurydice_reminiscences',
 'eurydice_tracked_in_parliament',
 'brading_haven_reclamation',
 'bembridge_improvement']

In [2]:
from sqlite_utils import Database

db_name = "eurydice-demo.db"

# Uncomment the following lines to connect to a pre-existing database
db = Database(db_name)

In [3]:
# Do not run this cell if your database already exists!

# While developing the script, recreate database each time...
db = Database(db_name, recreate=True)

In [4]:
# This schema has been evolved iteratively as I have identified structure
# that can be usefully mined...

db["sources"].create({
    "url": str,
    "fn": str,
    "publication": str,
    "published_date": str, # this may range from year to actual date
    "title": int, # Title of section
    "date": str, # optional; the second date field; may be eg correspondence date
    "author": str, # attempt at provenance
    "pages": str, # or pages like
    "text": str,
},# pk=("url", "title") # Need an autoincrement; no natural key?
)

# Enable full text search
# This creates an extra virtual table (books_fts) to support the full text search
db["sources"].enable_fts(["publication","title", "text", "published_date"], create_triggers=True)

<Table sources (url, fn, publication, published_date, title, date, author, pages, text)>

In [5]:
fn = "training_ship.md"

def get_file_contents(fn):
    """Open file from filename and get file contents."""
    with open(fn) as f:
        txt = f.read().strip()
    return txt

txt = get_file_contents(fn)
txt[:100]

'# H.M.S. Eurydice, Training Ship\n\nIn 1877, H.M.S. Eurydice was refitted as a training ship. For many'

In [6]:
def get_sections(txt):
    """Get sections from file."""
    txt_sections = [s.strip('-').strip() for s in txt.split("--") if s.strip('-').strip()]
    return txt_sections

txt_sections = get_sections(txt)
txt_sections

['# H.M.S. Eurydice, Training Ship\n\nIn 1877, H.M.S. Eurydice was refitted as a training ship. For many years, the fleet had been moving over to armoured steamships and rigged sailing ships were no longer appropriate for modern naval warfare. However, it was still felt that sailors should still be able to handle traditionally rigged vessels and that training ratings on them would be to their benefit as seamen.\n\n![](../images/ILN_1878_loss_of_euridyce_apr_06_011.jpg)\nIllustrated London News — H.M.S. Eurydice as she lay in Portsmouth harbour before her last voyage - from a photograph, April 6, 1878\n\n## On the Training of Sailors\n\nThe Eurydice was thus converted over to a training ship in the Spring of 1877, before embarking on a voyage to the West Indies in Autumn 1877.',
 "https://www.britishnewspaperarchive.co.uk/viewer/bl/0000919/18770206/042/0005\n\nSouth Wales Daily News\nTuesday 06 February 1877\np5\n\nSEAMANSHIP OF YOUNG SAILORS.\n\nThe Admiralty are about to take practica

In [7]:
# First section is text, but then we need to parse type

def structure_record(txt_sections):
    typ_s = [("TEXT", txt_sections[0])]

    for s in txt_sections[1:]:
        s = s.strip()
        if s.startswith("TEXT"):
            typ_s.append(("TEXT", s.replace("TEXT","").strip()))
        elif s.startswith("!["):
            typ_s.append(("IMAGE", s))
        else:
            # Should we assume text unless we get eg an http at the start of a record?
            typ_s.append(("RESOURCE", s))

    return typ_s

typ_s = structure_record(txt_sections)
typ_s

[('TEXT',
  '# H.M.S. Eurydice, Training Ship\n\nIn 1877, H.M.S. Eurydice was refitted as a training ship. For many years, the fleet had been moving over to armoured steamships and rigged sailing ships were no longer appropriate for modern naval warfare. However, it was still felt that sailors should still be able to handle traditionally rigged vessels and that training ratings on them would be to their benefit as seamen.\n\n![](../images/ILN_1878_loss_of_euridyce_apr_06_011.jpg)\nIllustrated London News — H.M.S. Eurydice as she lay in Portsmouth harbour before her last voyage - from a photograph, April 6, 1878\n\n## On the Training of Sailors\n\nThe Eurydice was thus converted over to a training ship in the Spring of 1877, before embarking on a voyage to the West Indies in Autumn 1877.'),
 ('RESOURCE',
  "https://www.britishnewspaperarchive.co.uk/viewer/bl/0000919/18770206/042/0005\n\nSouth Wales Daily News\nTuesday 06 February 1877\np5\n\nSEAMANSHIP OF YOUNG SAILORS.\n\nThe Admiralty

In [8]:
#%pip install dateparser
# https://dateparser.readthedocs.io/en/latest/usage.html
from dateparser.date import DateDataParser

ddp = DateDataParser(languages=['en'])

In [9]:
import re

dt = "%Y-%m-%d"

def parse_sections(txt_sections, fn=None):
    """Parse file section."""
    records = []
    for section in txt_sections:
        txt_lines = [l.strip() for l in section.split('\n') if l.strip()]
        #print(txt_lines)
        record = {"fn":fn}
        for i, line in enumerate(txt_lines):
            line = line.strip()
            # This is inefficient...
            # We should test as fallback...
            try_url = line.startswith("http")
            try_date = ddp.get_date_data(line.replace('Publication date', '').strip())
            try_pages = re.search(r"^pp?\.?\s?([0-9ivxlcm\?].*)", line)

            if try_url:
                record["url"] = line
            elif try_date["date_obj"]:
                if "published_date" in record:
                    record["date"] = try_date.date_obj.strftime(dt)
                else:
                    record["published_date"] = try_date.date_obj.strftime(dt)
            elif try_pages:
                record["pages"] = try_pages.group(1)
            elif not "publication" in record:
                record["publication"] = line

            # We take pages as the last item of metadata...
            if try_pages:
                break
        
        txt = f'{record["pages"]}'.join(section.split(try_pages.group(0))[1:]).strip()
        if len(txt.split("\n")[0]) > 200:
            record["title"] = txt[:100]
            record["text"] = txt[100:]
        else:
            record["title"] = txt.split("\n")[0]
            record["text"] = txt.replace(record["title"], "").strip()
        #if len(txt_lines[i+1])>200:
        #    record["title"] = txt_lines[i+1][:100]
        #    record["text"] = "\n\n".join(txt_lines[i+1:])[100:]
        #else:
        #    record["title"] = txt_lines[i+1]
        #    record["text"] = "\n\n".join(txt_lines[i+2:])
        records.append(record)

    return records

In [10]:
parse_sections([typ_s[1][1]])

  date_obj = stz.localize(date_obj)


[{'fn': None,
  'url': 'https://www.britishnewspaperarchive.co.uk/viewer/bl/0000919/18770206/042/0005',
  'publication': 'South Wales Daily News',
  'published_date': '1877-02-06',
  'pages': '5',
  'title': 'SEAMANSHIP OF YOUNG SAILORS.',
  'text': "The Admiralty are about to take practical measures for improving the seamanship of our young sailors. At present a boy having served a certain time on board a training-ship is transferred to a flagship, where he becomes an ordinary seaman. He is then draughted to a sea-going ship, and may, under favourable conditions, become an expert and efficient seaman, knowing the name and use of every rope on board, and capable of turning his hands to anything that may be required in the severest weather. It may happen, however, that he is sent to a ram of Rupert type, or a mastless ship like the Devastation where he can learn little or nothing of his profession; and as vessels of these classes are increasing, and likely to increase, it is necessary t

In [11]:
from datetime import datetime
import humanize

def admonition_generator(record):
    """Generate MyST admonition markdown for the record."""
    dt_ = datetime.fromisoformat(record["published_date"])
    # The humanize package gives us things like 3rd, 27th, etc.
    daynum = humanize.ordinal(dt_.day)
    # Format the date to something like: Wednesday, April 3rd, 1878
    # %A is the day of the week (Monday, Tuesday, etc.)
    # %B is the month (March, April, etc.)
    # %Y is the 4-digit year (eg 1878)
    dt = dt_.strftime(f'%A, %B {daynum}, %Y')
    admonition = f"""
```{{admonition}} {record["title"]} - {dt}
:class: note dropdown

[{record["publication"]}]({record["url"]}), {record["published_date"]}, p. {record["pages"]}

{record["text"]}

```
"""
    return admonition

In [12]:
print(admonition_generator(parse_sections([typ_s[1][1]])[0]))


```{admonition} SEAMANSHIP OF YOUNG SAILORS. - Tuesday, February 6th, 1877
:class: note dropdown

[South Wales Daily News](https://www.britishnewspaperarchive.co.uk/viewer/bl/0000919/18770206/042/0005), 1877-02-06, p. 5

The Admiralty are about to take practical measures for improving the seamanship of our young sailors. At present a boy having served a certain time on board a training-ship is transferred to a flagship, where he becomes an ordinary seaman. He is then draughted to a sea-going ship, and may, under favourable conditions, become an expert and efficient seaman, knowing the name and use of every rope on board, and capable of turning his hands to anything that may be required in the severest weather. It may happen, however, that he is sent to a ram of Rupert type, or a mastless ship like the Devastation where he can learn little or nothing of his profession; and as vessels of these classes are increasing, and likely to increase, it is necessary that special measures should b

In [13]:
# Parse image
# if we have an image, we need to pattern match until we get to a \n\n
# then create a figure and replace the original image matched pattern
xx="""

ssdsdd

![](../images/ILN_1878_loss_of_euridyce_apr_06_03x.png)
Illustrated London News — H.M.S. Eurydice as she lay at eight a.m. on March 25 off Dunnose Point, Isle of Wight, April 6, 1878

asaa

![](../images/ILN_1878_loss_of_euridyce_apr_06_03x.png)
Illustrated London News — H.M.S. Eurydice as she lay at eight a.m. on March 25 off Dunnose Point, Isle of Wight, April 6, 1878

asaa
"""

# The following says: .*? lazy search, (?=\n\n) lookahead to next \n\n
# re.MULTILINE | re.DOTALL give us the search over multiple lines
p = re.findall(r'!\[.*?(?=\n\n)', xx, re.MULTILINE | re.DOTALL)
p

['![](../images/ILN_1878_loss_of_euridyce_apr_06_03x.png)\nIllustrated London News — H.M.S. Eurydice as she lay at eight a.m. on March 25 off Dunnose Point, Isle of Wight, April 6, 1878',
 '![](../images/ILN_1878_loss_of_euridyce_apr_06_03x.png)\nIllustrated London News — H.M.S. Eurydice as she lay at eight a.m. on March 25 off Dunnose Point, Isle of Wight, April 6, 1878']

In [14]:
 re.findall("!\[[^\]]*\]\(([^\)]*)\)(.*)$", p[0], re.MULTILINE | re.DOTALL)[0][1]

'\nIllustrated London News — H.M.S. Eurydice as she lay at eight a.m. on March 25 off Dunnose Point, Isle of Wight, April 6, 1878'

In [15]:
def generate_figure(doc):
    
    images = re.findall(r'!\[.*?(?=\n\n)', doc, re.MULTILINE | re.DOTALL)
    for image in images:
        path =  re.findall("!\[[^\]]*\]\(([^\)]*)\)(.*)$", image, re.MULTILINE | re.DOTALL)
        if not path:
            continue
        txt = f"""
```{{figure}} {path[0][0]}
---
---
{path[0][1]}
```

"""
        doc = doc.replace(image, txt)
    return doc

In [80]:
# Parse types

def create_admontions(typ_s):
    parsed = []
    for s in typ_s:
        if s[0] =="TEXT" or s[0]=="IMAGE" or s[0].startswith("!["):
            # parse image
            parsed.append( s )
        elif s[0] == "RESOURCE":
            # parse resource
            # Put things into an admonition block
            #print("\n\n\*****\n"+s[1])
            _parsed = admonition_generator(parse_sections([s[1]])[0])

            parsed.append( (s[0], _parsed) )
        else:
            # This should be null
            pass

    return parsed

In [79]:
parsed = create_admontions(typ_s)
parsed = generate_figure(parsed)
parsed



\*****
https://www.britishnewspaperarchive.co.uk/viewer/bl/0000170/18790628/007/0005
Isle of Wight Observer
Saturday 28 June 1879

p5

The Reclamation of Brading Harbour.

Just as we are going to press, we learn from the contractor, Mr. Frederick Seymour, that the work of reclaiming Brading Harbour, which has been carried on for the last two or three years, has at last terminated successfully, and an immense tract of what we hope will be ere long arable land added to the island. Ihe contractors have had many difficulties to contend with, but it was only to be expected that with all the modern appliances at his command, Mr. Seymour would be as successful as his great predecessor in the same work two centuries ago— Sir Hugh Middleton.


\*****
https://www.britishnewspaperarchive.co.uk/viewer/bl/0000495/18790709/028/0004
Hampshire Advertiser
Wednesday 09 July 1879
p4

BRADING, July 9.

Brading Haven Reclamation Scheme.— During the few years the works in connection with the above scheme 

KeyError: 'pages'

In [18]:
myst_txt = "\n\n".join([t[1] for t in parsed ])

with open("test.md", "w") as f:
    f.write(myst_txt)

In [85]:
for f in raw_files:
    txt = get_file_contents(f"{f}.md")
    txt_sections = get_sections(txt)
    typ_s = structure_record(txt_sections)
    parsed = create_admontions(typ_s)
    with open(f"__{f}.md", "w") as f:
        myst_txt = "\n\n".join([t[1] for t in parsed ])
        myst_txt = generate_figure(myst_txt)
        f.write(myst_txt)

In [65]:
f

'life_aside_2'