Skip to content

jiemakel/fica

Repository files navigation

The FiCa tool is a user interface aimed at enabling the end user to, as quickly as possible, filter and categorize a set of data based on contextual information.

The FiCa interface Figure 1. The FiCa interface.

The primary mode of working with FiCa is through the keyboard. In the FiCa interface, shown in Figure 1, the entries to process are shown in a spreadsheet-like interface on the left of the screen. Through hotkeys, the user is able to move through these, and make filtering and categorization decisions through quick keypresses. On the right, contextual information is automatically loaded for the row in focus to aid these decisions. In the configuration depicted, this information is, on the top right, how the word appears in its original letter context, and on the bottom right, the information on the word in the OED.

As said, in the CEEC, spelling variation is significant. For filtering, this has relevance due to the fact that out of the 6,800 surface forms, many are in fact the same word, just spelled differently. To make use of this fact, FiCa allows grouping of words hierarchically by algorithmically calculated keys, enabling the user to make decisions for a whole group using just a single keypress.

For the -er case, we used two main means of calculating such grouping keys. Firstly, we used VARD2 (Baron, Rayson & Archer 2009), which is a commonly used tool to assist in the normalization of the spelling of Early and Late Modern English texts. We have a VARD2 produced and human validated normalized version of a subset of our corpus, namely the letters from collections spanning the 16th–18th centuries; however, the normalization is only applied to sufficiently frequent words, so a number of low-frequency types remain non-normalized (Palander-Collin & Hakala 2011). We used the VARD2 mapping between spelling variants and their normalizations to extract normalized grouping keys for use in FiCa. Secondly, as spelling variation in the absence of standardization often arises between phonetically similar forms, we made use of the Metaphone algorithm (Philips 2000) to calculate phonetic keys of two different granularities for the words.

In FiCa, these keys were then organized into a hierarchy, starting with the more general phonetic key, and progressing through the more fine-grained phonetic key and the normalized form to the individual words. Figure 1 shows an example of what such a grouping looks like in the spreadsheet portion of the interface. Here, the rightmost column is the max ten character Metaphone key, with a four character Metaphone key to its left. On the left side, on the other hand, are first the VARD2 normalized form of the word, and then the original form as it appears in the corpus. The middle columns are first the filtering and categorization columns, and finally a frequency column for information. In the interface, groups are shown as uneditable rows, highlighting the common row values that make up the group key.

Here, the highest level group line at the top shows that the eight lines below it are grouped under the Metaphone4 key of RMMR. Seven of those are further grouped also under the exact same Metaphone10, while the final one has a longer Metaphone10 key of RMMRNSR (note that distinct hierarchical group rows are not shown where they do not add any information or distinct choices, as here). Inside the Metaphone10 groups, all entries are themselves distinct apart from one instance, where the VARD2 based normalizer has determined that rememb=r= (== here encoding a superscript) also normalizes to remember.

Using FiCa this way, none of the algorithms are trusted blindly, but on the other hand, similar words are grouped together regardless of their exact spelling. If the words in a higher-level group happen to represent the same lemma, decisions can be made on that level using a single keypress. In this case, the Metaphone10 groups are the level of meaningful distinction, so the right course of action is to filter out the group with the key of RMMR and keep the one line with remembrancer and the RMMRNSR key, as well as encode the -er1 (agentive) category for this word, determined from the CEEC and OED context views. When the algorithms do not find suitable lemma forms for the word, FiCa also allows the user to change the generated form, thus verifying and correcting the output and grouping. This also automatically updates the OED context view to show the proper word for making grounded decisions.

About

A user interface for quickly filtering and categorizing complicated data based on contextual information.

Resources

Stars

Watchers

Forks

Packages

No packages published