# Soirbhiochas Demo

For this demo to work, you will need a copy of the Irish Universal Dependencies treebank (`ga_idt-ud-train.conllu`), a copy of common prefixes in Irish (WARNING: VERY ROUGH) from `github.com/philtweir/wikt-irish-prefixes` and the BuNaMo corpus, which should by default be placed in `./data` (`github.com/michmech/BuNaMo`). If you have any difficulty finding these, check the `build.sh` script.

In [1]:
from soirbhiochas import staidreamh
from soirbhiochas import visualization
from soirbhiochas.leabharlann import * # simplifies rule-building

To address perceived minor issues, this actually changes some of the regex matching in Opers.Slenderize - marked with a v2 flag.


In [2]:
from soirbhiochas.díolaim import Díolaim, CorrectionDict
from soirbhiochas.parsáil import Lexicon

In [3]:
CONLLU = "ga_idt-ud-train.conllu"
TYPOS = "suspected_typos.txt"

In [4]:
corrections: CorrectionDict = {}

with open(TYPOS, 'r') as f:
    correction_lines = f.readlines()
for line in correction_lines:
    from_form, from_upos, to_form, to_upos = line.split(',')
    corrections[(from_upos, from_form)] = (to_upos, to_form)

## Rule-building

Here we build Caol le Caol using simpler rules. The implementation of these is also fairly basic, but it's easiest to pull them in from `soibhiochas.leabharlann`

In [28]:
class CaolLeCaol(Riail):
    gairid = "Caol le caol"
    prefix = "clc"
    fada = "Caol le caol, leathan le leathan"
    béarla = "Slender with slender, broad with broad"
    míniú = "Caithfidh go haontaíonn na gutaí ar dhá thaobh consain"
    soláithraíonn = ("pointí_teipe",)

    def tástáladh(self, focal: FocalGinearalta, aschuir: dict) -> bool:
        aschuir["pointí_teipe"] = []
        focal = focal.lower()
        consain_blocks = get_consonant_blocks(focal)
        for s, f in consain_blocks:
            if s > 0 and f < len(focal) - 1:
                g1 = focal[s - 1]
                g2 = focal[f + 1]
                assert g1 in GUTAÍ
                assert g2 in GUTAÍ

                if g1 in GUTAÍ_CAOL and g2 in GUTAÍ_LEATHAN \
                        or g1 in GUTAÍ_LEATHAN and g2 in GUTAÍ_CAOL:
                    aschuir["pointí_teipe"].append((s, f))
        return not bool(aschuir["pointí_teipe"])

In [31]:
CAOL_LE_CAOL = (CaolLeCaol()
    # Exceptions
    .eisceacht_a_dhéanamh(
        # Contractions, if the expanded form passes (we put this here to ensure
        # they are not labelled as compound words, e.g. anseo vs anbhás)
        IsContraction() & expanded_form_passes,
        "...ach amháin 'ansin' agus 'anseo'"
    )
    .eisceacht_a_dhéanamh(
        # Compound words, if the breakpoints are where CLC fails
        FocailChumaisc() & is_breakpoint_in_failure_area,
        "...agus roinnt focail chumaisc"
    )
    .eisceacht_a_dhéanamh(
        # Is a preposition
        is_a_preposition,
        "...agus roinnt reamhfhocail"
    )
    .eisceacht_a_dhéanamh(
        # Where a slender e is really a broad ae
        a_is_ae,
        "...agus nuair a bhíonn 'ae' leathan i ndáirire"
    )
    .eisceacht_a_dhéanamh(
        # Just one of those words that just begins with an A
        # Mainly: arís, areir, aniar, adeir, ...
        begins_with_a,
        "...agus roinnt dobhríathair a thosaíonn le 'a'"
    )
    .eisceacht_a_dhéanamh(
        # This is a loanword
        is_foreign,
        "...agus focail iasachta"
    )
    .eisceacht_a_dhéanamh(
        # There are enough examples to suggest this is the one
        # true exception - féadfaidh exists, but so does féadfidh
        # in a range of sources
        RiailIs("féadfidh"),
        "...agus 'féadfidh'"
    )
)

## Dictionary data

Next we load dictionaries and third-party data.

In [32]:
staidreamh.add_loadable("prefixes", "wikt-irish-prefixes/wikt-irish-prefixes.txt")

lexicon = Lexicon()
lexicon.load()
staidreamh.add_loadable("lexicon", lexicon)

corpas = Díolaim.cruthaíodh_as_comhad(CONLLU, lexicon.find_by_token, corrections=corrections)

## Executing the count

Finally, we can run the rule and its exceptions against the corpus.

In [33]:
def only_for_normal_words(focal):
    return (focal.upos not in ('PROPN', 'SYM', 'X'))

In [35]:
CAOL_LE_CAOL.set_sample_size(5)
counter, statistics = staidreamh.count_rule_by_word(CAOL_LE_CAOL, corpas.de_réir_focal(), only_for_normal_words)
print(f"Found {len(counter)} exceptions")
most_common = counter.most_common(150)
for row in zip(*[most_common[i::3] for i in range(3)]):
    row = [r[0] if r[1] == 1 else f"{r[0]} {r[1]}" for r in row]
    print(f"{row[0]: >20} {row[1]: >20} {row[2]: >20}")

counts = CAOL_LE_CAOL.get_counts()
staidreamh.print_count(counts, indent=2)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 95881/95881 [00:27<00:00, 3444.93it/s]

Found 122 exceptions
   t-uisceshaothrú 3            shaolré 2               insa 2
          istoíche 2       dlúthdhiosca 2         ghéarchéim 2
       breisluacha 2            neamhní 2    ranníocaíochtaí 2
     uisceshaothrú 2       fhadtéarmach 2          eischósta 2
               agena        Bordcheantair        Bhordcheantar
            tsecoind              magenta           tonnchrith
   pobalbhreitheanna fadbhreathnaitheacha         cillscannáin
           suimiúila            phíchairt           lánchreach
luath-athbheochantóirí     lúthchleasaíocht             ghiorria
           bánéadach                Dhera      meitéareolaithe
            pubcheol       consighneachta        dtoghcheantar
        dlíthairgthe             déanfidh            dheonfidh
             ndintar               dintar            shaorfidh
     gairmoideachais              lánseol            fáthmheas
           dimhúinte            tuinairde      thonnchreathach
      dlúthcheirníní         pob




Note that the exceptions printed at the end are those that have not been caught by any registered exception to the top-level rule (Caol le caol).

## Visualization

We now use altair to get an interactive diagram illustrating these counts.

In [36]:
ad = visualization.counts_to_vegalite(CAOL_LE_CAOL.fada, counts)

In [37]:
from altair import vegalite

vegalite.display.VegaLite._validate = lambda self: True
vegalite.display.VegaLite(ad)