# Soirbhíochas Gender Demo

A common challenge for students learning Irish is to identify a noun's gender. Any set of rules for doing so has many exceptions, but we try one set below to see how they get on against a representative set of pre-analysed sentences.

It is important to note that Soirbhíochas is a tool for showing how a rule and its exceptions combine to cover most or all circumstances. This is subtly different from seeing how adding each exception _changes_ the outcome (that's a nice to have for the future), but the simplest is that you _need to be able to know an exception only applies to cases that break the rule_. Hence we assume you know the gender and say "all words are masculine, unless they are feminine and slender..." instead of starting with you not knowing and saying "assume a word is masculine, unless it is slender". If this doesn't make sense, don't worry, but bear in mind that the graph below will not tell you how effective assuming slender words are feminine and non-slender words are masculine is, so if you can tell that, you're reading it wrong :) (but you could adjust the rules a bit to find out)

The ruleset we use below is taken from [Wikibooks](https://en.wikibooks.org/wiki/Irish/Reference/Nouns) but it is a common sequence. The order is mixed a little, as it is a more accurate (but not perfect) assumption that a country or language is feminine, than determining whether it is "feminine _and_ slender" if you don't know the gender in the first place.

A significant sidepoint is that most of these gender-guessing rules assume you know the root of the word (and therefore also whether it is in the root form). For instance, the rules work for `Meáinmhuir (f)` (Mediterranean) but not `Meáinmhara` (same in genitive). While having the lemma (base form) is another stumbling block, it's sometimes guessable more easily than the gender, so try comparing the results using `WITH_LEMMA` on (when you see a word, you try its lemma against the rules) and off (you apply the rules to the word itself).

For this demo to work, you will need a copy of the Irish Universal Dependencies treebank (`ga_idt-ud-train.conllu`), a copy of common prefixes in Irish (WARNING: VERY ROUGH) from `github.com/philtweir/wikt-irish-prefixes` and the BuNaMo corpus, which should by default be placed in `./data` (`github.com/michmech/BuNaMo`). If you have any difficulty finding these, check the `build.sh` script.

In [1]:
from soirbhiochas import staidreamh
from soirbhiochas import visualization
from soirbhiochas.leabharlann import * # simplifies rule-building

To address perceived minor issues, this actually changes some of the regex matching in Opers.Slenderize - marked with a v2 flag.


In [2]:
from soirbhiochas.díolaim import Díolaim, CorrectionDict
from soirbhiochas.parsáil import Lexicon

In [3]:
CONLLU = "ga_idt-ud-train.conllu"
TYPOS = "suspected_typos.txt"

In [4]:
corrections: CorrectionDict = {}

with open(TYPOS, 'r') as f:
    correction_lines = f.readlines()
for line in correction_lines:
    from_form, from_upos, to_form, to_upos = line.strip().split(',')
    corrections[(from_upos, from_form)] = (to_upos, to_form)

In [5]:
lexicon = Lexicon()
lexicon.load()

## A choice

Firstly, we decide whether we want to assume we can guess a word's lemma correctly - most of the "which gender is this noun?" strategies assume you can.

In [6]:
WITH_LEMMA = True

def m(focal):
    return focal.lemma if WITH_LEMMA else (focal)

## Rule-building

Here we build our gender-guesser using a series of rules.

In [7]:
from gramadan.features import Gender
from gramadan.v2.entity import Entity
from gramadan.v2.opers import Opers
from gramadan.v2.noun import Noun

from soirbhiochas.rialacha import Riail
from soirbhiochas.parsáil import FocalGinearalta

class GlacFirinscneach(Riail):
    gairid = "Glac firinscneach"
    prefix = "gf"
    fada = "Glac leis go bhfuil gach focal firinscneach"
    béarla = "Assume every word is masculine"
    míniú = "Tá formhór na bhfocal firinscneach, mar sin glac leis go bhfuil siad uile"
    soláithraíonn = ()

    def tástáladh(self, focal: FocalGinearalta) -> bool:
        return focal.focal.gender == Gender.Masc

The first is very simple: we "pass" the check if the word is masculine - i.e. the rule is that all words are masculine (most of the code above is helptext).

In [8]:
def deireadh_caol(focal: FocalGinearalta) -> bool:
    return Opers.IsSlenderEnding(m(focal))

The next is not much more complex - this will pass if the word ends in a slender consonant or vowel.

In [9]:
import re

KNOWN_FEMININE_ENDINGS = [
    "eog",
    "óg",
    "lann"
]
re_fe = re.compile(f"({'|'.join(KNOWN_FEMININE_ENDINGS)})$")

def has_a_known_feminine_ending(focal: FocalGinearalta) -> bool:
    return (re_fe.search(m(focal)) is not None)

This one uses a regular expression to see if the word ends in one of the listed endings.

In [10]:
def is_a_multisyllable_word_ending_in_acht_or_íocht(focal: FocalGinearalta) -> bool:
    return (
        (re.search("(acht|íocht)$", m(focal)) is not None) and
        Opers.PolysyllabicV2(m(focal))
    )

Here, we use a little more of GramadánPy's tooling to pick out polysyllabic words ending in certain endings.

In [11]:
relevant_fourth_declension_feminine_words = set()
def is_fourth_declension_feminine(focal: FocalGinearalta) -> bool:
    if m(focal)[-1] not in Opers.VowelsSlender or focal < "PROPN":
        return False
    
    # We record this simply so you can see what they are later
    if focal.focal.gender == Gender.Fem:
        relevant_fourth_declension_feminine_words.add(focal.focal.getLemma())
    
    return focal.focal.declension == 4

Finally, we make a rule that checks if a noun is fourth declension and ending in e or i - this probably seems more complicated than knowing its gender! However, we know this will be run only in the scenario that the word is feminine, so this is a good approximation (but far from perfect) for "abstract nouns" ending in a slender vowel (e.g. comhairle or aiste).

In [12]:
from soirbhiochas.collections import COUNTRIES, LANGUAGES

def is_a_country(focal: FocalGinearalta):
    return focal < "PROPN" and focal in COUNTRIES

def is_a_language(focal: FocalGinearalta):
    return focal < "PROPN" and focal in LANGUAGES

Lastly, we use two collections - one for countries and one for languages (bear in mind, both are imperfect, we're playing the odds over a large corpus). These are a little bit smart, so, for example, even if some countries always appear with or without a definite article, this should match either way (e.g. if you search for `an tSin` or `Sin`).

In [13]:
PIOC_INSCNE = (GlacFirinscneach()
    .eisceacht_a_dhéanamh(
        is_a_language,
        "...or if it is a feminine language"
    )
    .eisceacht_a_dhéanamh(
        is_a_country,
        "...or if it is a feminine country"
    )
    .eisceacht_a_dhéanamh(
        is_fourth_declension_feminine,
        "...or it is (roughly) a feminine abstract noun ending in e/i"
    )
    .eisceacht_a_dhéanamh(
        is_a_multisyllable_word_ending_in_acht_or_íocht,
        "...or if it is a feminine multisyllable word with known ending"
    )
    .eisceacht_a_dhéanamh(
        has_a_known_feminine_ending,
        "...or if it is a feminine word with another standard feminine ending"
    )
    .eisceacht_a_dhéanamh(
        deireadh_caol,
        "...or if it is feminine with a slender ending"
    )
)

Now, we put it all together. We explicit note that the exceptions are all only matching feminine words. All that remains is to load the corpus and run it.

## Dictionary data

Next we load dictionaries and third-party data.

In [14]:
staidreamh.add_loadable("prefixes", "wikt-irish-prefixes/wikt-irish-prefixes.txt")
staidreamh.add_loadable("countries", "wikt-irish-prefixes/countries-ga.txt")
staidreamh.add_loadable("languages", "wikt-irish-prefixes/wikt-languages.txt")
staidreamh.add_loadable("lexicon", lexicon)

corpas = Díolaim.cruthaíodh_as_comhad(CONLLU, lexicon.find_by_token, corrections=corrections)

In [15]:
SEEMINGLY_BROKEN = ("ann", "doh", "té", "sul", "(e)amar", "uile")
# I could be wrong, but the above appear as normal (not e.g. substantive or loaned) nouns
# at least once in corpus with insufficient information to use them, so we skip.

def only_nouns_with_known_gender(focal: FocalGinearalta):
    return (focal < "NOUN" or focal < "PROPN") and \
        focal.focal and focal.focal.gender and \
        focal.focal.getLemma() and \
        focal.token.form not in SEEMINGLY_BROKEN

We will want to filter the whole corpus to certain words - for instance, these rules cannot test adjectives, and we need to have gender and the lemma available to run the rules.


## Executing the count

Finally, we can run the rule and its exceptions against the corpus.

In [16]:
PIOC_INSCNE.set_sample_size(5)
counter, statistics = staidreamh.count_rule_by_word(PIOC_INSCNE, corpas.de_réir_focal(), only_nouns_with_known_gender)
print(f"Found {len(counter)} exceptions")
most_common = counter.most_common(150)
for row in zip(*[most_common[i::3] for i in range(3)]):
    row = [r[0] if r[1] == 1 else f"{r[0]} {r[1]}" for r in row]
    print(f"{row[0]: >20} {row[1]: >20} {row[2]: >20}")

counts = PIOC_INSCNE.get_counts()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 95881/95881 [00:06<00:00, 15360.39it/s]

Found 366 exceptions
            uacht 67            teanga 56           theanga 24
             mian 23          foireann 22         fhoireann 20
           foirne 18              bean 13              lámh 13
          bheatha 12             scoth 11             bhean 10
            lámha 10         dteangacha 9              rogha 9
            sprioc 9            Feabhra 8          teangacha 8
            ríocht 8         athnuachan 8              leath 7
            Nuacht 7              láimh 7            iomarca 7
               mná 7         cuideachta 7               cuma 6
             chuma 6              leaba 6          fadhbanna 6
        spriocanna 6              eagla 5            Chraobh 5
      gCuideachtaí 5               mhná 5             chluas 5
          gCeathrú 5            dteanga 5               pian 4
             cluas 4              deoch 4              easpa 4
            nGaoth 4                ban 4             críche 4
              trua 4             b




Note that the exceptions printed at the end are those that have not been caught by any registered exception to the top-level rule (all words are masculine).

## Visualization

We now use altair to get an interactive diagram illustrating these counts.

In [17]:
ad = visualization.counts_to_vegalite(PIOC_INSCNE.fada, counts)

In [18]:
from altair import vegalite

vegalite.display.VegaLite._validate = lambda self: True
vegalite.display.VegaLite(ad)

And, fairly good news - our rules cover 97% of the words tested in the corpus (with repetition). Unfortunately, to apply them the way they are written, you need to know the gender in the first place! In the future, we can have soirbhíochas split out accuracy scores, so we can see how row-by-row the accuracy changes, and get rid of the assumption of knowing gender up to the final line.

Lastly, lets see how well "fourth declension ending in e/i" approximates abstract nouns:

In [19]:
relevant_fourth_declension_feminine_words

{'Roinnse',
 'achainí',
 'achoimre',
 'aice',
 'aicme',
 'aigne',
 'ailse',
 'ainnise',
 'airde',
 'aire',
 'aisce',
 'aiste',
 'aithne',
 'allmhaire',
 'ardchomhairle',
 'bainistí',
 'biaiste',
 'braisle',
 'bréige',
 'brí',
 'buaine',
 'bunáite',
 'caoi',
 'clé',
 'coimirce',
 'coinne',
 'coitinne',
 'comhairle',
 'comhchomhairle',
 'cruinne',
 'cré',
 'cuimhne',
 'cuimse',
 'cuisle',
 'cé',
 'daille',
 'dea-ghuí',
 'deaide',
 'dearóile',
 'deise',
 'deoise',
 'diolúine',
 'dlinse',
 'dlínse',
 'drochshláinte',
 'dé',
 'déanaí',
 'díbhe',
 'díolúine',
 'díthchéille',
 'dúiche',
 'easláine',
 'easláinte',
 'eisimirce',
 'eite',
 'fadtréimhse',
 'faiche',
 'faillí',
 'faire',
 'fairsinge',
 'farraige',
 'feiste',
 'fianaise',
 'fionraí',
 'foiche',
 'foinse',
 'foraithne',
 'fáilte',
 'féile',
 'fírinne',
 'gearrthréimhse',
 'gloine',
 'glúinte',
 'gnáth-thodhchaí',
 'gné',
 'guaille',
 'guí',
 'gé',
 'imirce',
 'imní',
 'inbhuaine',
 'inmhe',
 'inse',
 'iomláine',
 'iontaise',
 'iubha

What words aren't there that you might expect to match an "abstract nouns ending in e/i" rule (e.g. `dlí (m)`)? Have a look at [Wikibooks Irish: Nouns](https://en.wikibooks.org/wiki/Irish/Reference/Nouns) to dig into this in detail, and see important exceptions called out.