In [1]:
SECT57 = """
{{section|s57|§ 57.}} The O.Ir. prefix variously spelt ir-, er-, aur- (now written ur‑) is pro&shy;nounced ''o̤r''. The common spelling with au was probably intended to denote some sound like ''o̤'', cp. O’Donovan, Grammar p.&nbsp;17. Medieval scribes seem to have been at a loss to represent this sound. The frequent appear&shy;ance of e for ''o̤'', cp. terus = turas RC. vii 296, terad for turud Wi. p.&nbsp;818, finds a parallel in the inter&shy;change of ''o̤'' and ï in Donegal, cp. §&nbsp;{{QDD|103}}. Examples: ''o̤rəχəsk'', ‘injection’, Di. urchosc; ''o̤rəχɔdʹ'', ‘harm’, M.Ir. erchoit, irchoit; ''o̤rəχər'', ‘shot’, M.Ir. erchor, aurchor, irchor, urchor; ''o̤rLαr'', ‘floor’, Wi. orlar; ''o̤rNỹ꞉'', ‘prayer’, M.Ir. ernaigthe, airnaig&shy;the; ''o̤rχəL'', ‘cricket’, Di. urchuil; ''o̤rsə'', ‘jamb’, M.Ir. irsa, ursa; ''o̤rLə'', ‘eaves, fringe’, M.Ir. urla; ''o̤rNʹæʃ'', ‘furniture’, Meyer airnéis; ''o̤rLuw'', ‘speech, eloquence’, O.Ir. erlabra, aurlabra (see §&nbsp;{{QDD|444}}). Note ''ɔ꞉rLə'', ‘vomit’, Di. orlughcan, urlacan with ''ɔ꞉'', *''o̤rbəL'', ‘tail’, M.Ir. erball has become ''ro̤bəL'' as elsewhere.
"""

In [31]:
DATA = {}

In [34]:
import re

def simple_extract(text):
    text = text.strip()
    section = 0
    count = 1
    if text[-1] == ".":
        text = text[:-1]
    m = re.match("^§ ([0-9]+)\.", text)
    if m:
        section = int(m.group(1))
        text = text[len(m.group(0)):]
    else:
        return None
    items = [x.strip() for x in text.split(";")]
    all = []
    for item in items:
        current = {
            "section": section,
            "id": f"{section}_{count}"
        }
        m = re.match("^([^,]+), ‘([^’]+)’$", item)
        if m:
            current["transcription"] = m.group(1)
            current["english"] = m.group(2)
        all.append(current)
        count += 1
    DATA[section] = all


In [33]:
DATA

{490: [{'section': 490,
   'id': '490_1',
   'transcription': 'glαk gə ·sɔkyr′ ə',
   'english': 'take it easy'},
  {'section': 490,
   'id': '490_2',
   'transcription': 'Nα kyr′ kɔ ·t′Uw̥ iəd',
   'english': 'do not set them so close'}]}

In [6]:
section = 1

In [17]:
PAGE = """
§ 3. This sound frequently represents O.Ir. a in accented syllables before non-palatal con­sonants, e.g. αrəm, ‘army’, O.Ir. arm; αt, ‘swelling’, O.Ir. att; fαnαχt ‘to stay, remain’, O.Ir. anaim; kαpəL, ‘mare’, M.Ir. capall; mαk, ‘son’, O.Ir. macc; mαLαχt, ‘curse’, O.Ir. maldacht; tαχtuw, ‘to choke’, O.Ir. tachtad; tαrt, ‘thirst’, O.Ir. tart; tαruw, ‘bull’, M.Ir. tarb.

§ 4. O.Ir. e before non-palatal con­sonants in accented syllables usually gives α, e.g. αχ, ‘steed’, O.Ir. ech; αlə, ‘swan’, M.Ir. ela; αŋ, ‘splice, strip’; αŋαχ, ‘fisherman’s net’, M.Ir. eng; dʹrʹαm, ‘crowd’, M.Ir. dremm; dʹαrəg, ‘red’, O.Ir. derg; fʹαr, ‘man’, O.Ir. fer; gʹαl, ‘white’, M.Ir. gel; kʹαχtər, ‘either’, O.Ir. cechtar; Lʹαnuw, ‘child’, M.Ir. lenab; Nʹαd, ‘nest’, M.Ir. net; pʹαkuw, ‘sin’, O.Ir. peccad; ʃαsuw, ‘to stand’, M.Ir. sessom; tʹαχ, ‘house’, O.Ir. tech.
"""
PAGE_NUM = 5

In [18]:
page_lines = [x for x in PAGE.replace("\u00ad", "").split("\n") if x != ""]

In [21]:
import re

_BASIC = r"^([^‘]+) ‘([^’]+)’"
BASIC = re.compile(_BASIC)

In [20]:
for line in page_lines:
    counter = 1
    if line.startswith("§ "):
        dot = line.find(". ")
        pn = line[2:dot]
        try:
            section = int(pn)
        except:
            continue
        current = {}
        if "e.g." in line:
            linep = line.split("e.g.")
            if len(linep) != 2:
                print(line)
            parts = [x.strip() for x in linep[1].split(";")]
            for part in parts:
                if part.endswith("."):
                    part = part[:-1]
                m = BASIC.match(part)
                if m:
                    current = {
                        "page": PAGE_NUM,
                        "section": section,
                        "id": f"{section}_{counter}",
                        "transcription": m.group(1),
                        "english": m.group(2)
                    }
            counter += 1


αrəm, ‘army’, O.Ir. arm 3 5
αt, ‘swelling’, O.Ir. att 3 5
fαnαχt ‘to stay, remain’, O.Ir. anaim 3 5
kαpəL, ‘mare’, M.Ir. capall 3 5
mαk, ‘son’, O.Ir. macc 3 5
mαLαχt, ‘curse’, O.Ir. maldacht 3 5
tαχtuw, ‘to choke’, O.Ir. tachtad 3 5
tαrt, ‘thirst’, O.Ir. tart 3 5
tαruw, ‘bull’, M.Ir. tarb 3 5
αχ, ‘steed’, O.Ir. ech 4 5
αlə, ‘swan’, M.Ir. ela 4 5
αŋ, ‘splice, strip’ 4 5
αŋαχ, ‘fisherman’s net’, M.Ir. eng 4 5
dʹrʹαm, ‘crowd’, M.Ir. dremm 4 5
dʹαrəg, ‘red’, O.Ir. derg 4 5
fʹαr, ‘man’, O.Ir. fer 4 5
gʹαl, ‘white’, M.Ir. gel 4 5
kʹαχtər, ‘either’, O.Ir. cechtar 4 5
Lʹαnuw, ‘child’, M.Ir. lenab 4 5
Nʹαd, ‘nest’, M.Ir. net 4 5
pʹαkuw, ‘sin’, O.Ir. peccad 4 5
ʃαsuw, ‘to stand’, M.Ir. sessom 4 5
tʹαχ, ‘house’, O.Ir. tech 4 5


In [5]:
page_lines

['§ 3. This sound frequently represents O.Ir. a in accented syllables before non-palatal consonants, e.g. αrəm, ‘army’, O.Ir. arm; αt, ‘swelling’, O.Ir. att; fαnαχt ‘to stay, remain’, O.Ir. anaim; kαpəL, ‘mare’, M.Ir. capall; mαk, ‘son’, O.Ir. macc; mαLαχt, ‘curse’, O.Ir. maldacht; tαχtuw, ‘to choke’, O.Ir. tachtad; tαrt, ‘thirst’, O.Ir. tart; tαruw, ‘bull’, M.Ir. tarb.',
 '§ 4. O.Ir. e before non-palatal consonants in accented syllables usually gives α, e.g. αχ, ‘steed’, O.Ir. ech; αlə, ‘swan’, M.Ir. ela; αŋ, ‘splice, strip’, αŋαχ, ‘fisherman’s net’, M.Ir. eng; dʹrʹαm, ‘crowd’, M.Ir. dremm; dʹαrəg, ‘red’, O.Ir. derg; fʹαr, ‘man’, O.Ir. fer; gʹαl, ‘white’, M.Ir. gel; kʹαχtər, ‘either’, O.Ir. cechtar; Lʹαnuw, ‘child’, M.Ir. lenab; Nʹαd, ‘nest’, M.Ir. net; pʹαkuw, ‘sin’, O.Ir. peccad; ʃαsuw, ‘to stand’, M.Ir. sessom; tʹαχ, ‘house’, O.Ir. tech.']