core/doc/IceNLP.tex

\documentclass[11pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{multirow}
\usepackage{setspace}
\usepackage{url}
\usepackage{natbib}
\usepackage{amsmath}

\oddsidemargin  0.0in
\evensidemargin 0.0in
\textwidth      6.0in
%\headheight     0.5in
\headheight     0.0in
\topmargin      0.0in
\textheight 9.0in

\bibpunct{(}{)}{,}{a}{}{;}

\title{IceNLP \\
A Natural Language Processing Toolkit for Icelandic\footnote{The following persons have contributed to the development of IceNLP: Hrafn Loftsson, Anton Karl Ingason, Aðalsteinn Tryggvason, Guðmundur Örn Leifsson, Hlynur Sigurþórsson, Ragnar Lárus Sigurðsson, Sverrir Sigmundarson and Robert Östling.} \\ \ \\ \ \\ \ \\
    User Guide}
\author{Hrafn Loftsson \\
        School of Computer Science \\
        Reykjavik University \\
        hrafn@ru.is 
\and
	Anton Karl Ingason \\
	School of Humanities \\
	University of Iceland \\
	antoni@hi.is \\ \ \\ \ \\
}

\begin{document}
\hyphenation{Ice-Tagger Ice-Morphy Reykja-vik}
\label{firstpage}
\date{July 2021}
\maketitle
\newpage
\tableofcontents
\newpage
%\begin{spacing}{1.5}

\section{What is IceNLP?}
\label{sec:intro}
\emph{IceNLP} is an open-source  Natural Language Processing (NLP) toolkit for analysing Icelandic text.
The toolkit consists of a tokeniser and a sentence segmentiser, the morphological analyser \emph{IceMorphy}, the linguistic rule-based tagger \emph{IceTagger}, the trigram tagger \emph{TriTagger}, the perceptron tagger \emph{IceStagger}, the shallow parser \emph{IceParser}, the lemmatiser \emph{Lemmald}, and the named entity recogniser \emph{IceNER}.
The system is written as a collection of Java classes.

The tokeniser is used for tokenising stream of characters into linguistic units and for performing sentence segmentation \citep{pal00}.

\emph{IceMorphy} is mainly used for guessing the tags for unknown words and filling \emph{tag profile gaps} in a dictionary \citep{lof08}.

\emph{IceTagger} is a linguistic rule-based tagger\footnote{As apposed to a data-driven tagger trainable on different languages.} for tagging Icelandic text \citep{lof06,lof08}.
It uses a large part-of-speech (PoS) tagset consisting of about 600 tags (see Section \ref{sec:tagset}).
Evaluation showes that \textit{IceTagger} achieved higher accuracy than the best performing data-driven tagger when tested using the same test corpora and the same ratio of unknown words \citep{lof08,lof09b,lof11,hel04}.
The average tagging accuracy, computed when tagging test corpora derived from the \emph{Icelandic Frequency Dictionary} (\emph{IFD}) corpus \citep{pin91}, is about 92\%\footnote{Tagging accuray is measured using a corrected version of the IFD corpus \citep{lof09}.}.  When using data from \emph{BÍN} (see Section \ref{sec:bin}), the Database of Modern Icelandic Inflections \citep{kri05}, the accuracy increases to about 92.8\%.

\emph{TriTagger} is a re-implementation of the well known statistical \emph{TnT} tagger \citep{bra00}.
By using \textit{TriTagger} as a word class tagger during initial disambiguation, then using \textit{IceTagger} to disambiguate tags that are consistent with the chosen word class, 
and finally using \emph{TriTagger} again to fully disambiguate words, to which \emph{IceTagger} is not able to assign unambiguous tags, an accuracy of about 92.7\% is achieved \citep{lof09b,lof11}.  By using \emph{BÍN}, the accuracy further increases to about 93.5\%.

\emph{IceStagger} is a modified version of \emph{Stagger}  \citep{ost13}, a tagger based on the Averaged Perceptron algorithm \citep{col02}.   By adding specific linguistic features and using \emph{IceMorphy}, an accuracy of about 92.8\% is achieved \citep{lof13}. By using \emph{BÍN}, the accuracy increases to about 93.7\%.

\emph{IceParser} is a shallow parser based on the incremental finite-state parsing technique \citep{mok97}.
It labels both constituent structure and grammatical functions.
Evaluation shows that F-measure for constituents and syntactic functions is 96.7\% and 84.3\%, respectively, when assuming perfectly tagged input \citep{lof07b}.

\emph{Lemmald} is a mixed method lemmatiser for Icelandic.
It combines the advantages of data-driven machine learning with linguistic insights to maximize performance.
Given correct tagging, the system lemmatizes Icelandic text with an accuracy of 99.55\% \citep{ant08}.

\emph{IceNER} is a rule-based named entity recogniser for Icelandic.
The system marks persons, companies, locations and events.
Evaluation has shown that \emph{IceNER} achieves an overall F-score of 71.5\% without using a gazette list, and 79.3\% when using a gazette list of only 523 names \citep{try09}.

\section{Installation}
The source of \emph{IceNLP} is available for download/cloning at \url{https://github.com/hrafnl/icenlp}.

Release versions (programs and data without source code) can be downloaded from \url{https://github.com/hrafnl/icenlp/releases}.

The description below assumes installation or a release version for the {\bf Linux} operating system.
The programs and data come in a zip-file named \emph{IceNLP-x.y.z.zip} (where \emph{x.y.z} is the current version number).
Run {\bf unzip} on this zip-file and extract all the files to a directory of your choice.

A main directory, {\bf IceNLP}, will be created with the following subdirectories: {\bf bat}, {\bf dict}, {\bf dist}, {\bf doc}, {\bf lib}, and {\bf ngrams}.

The {\bf bat} directory includes shell scripts (.sh files) for running individual components of the tool. The commands for each tool can be found in a subdirectory of the {\bf bat} directory (see Section \ref{sec:usage}).

The {\bf dict} directory contains various dictionaries related to the individual tools of \emph{IceNLP} as well as shell scripts to extract data from \emph{BÍN}. 

The {\bf dist} directory contains the \emph{IceNLPCore.jar} file.  This file consists of all the .class files needed to run \emph{IceNLP} along with default dictionaries (``resource files''). 

The {\bf doc} directory 	contains this user guide and a description of the Icelandic tagset.

The {\bf lib} directory contains various .jar files used by \emph{IceNLP}. 

The {\bf ngrams} directory contains tools for building ngram models.

\section{The tagset}
\label{sec:tagset}
The taggers in \emph{IceNLP} use the main Icelandic tagset, created during the making of the \emph{IFD} corpus.
Due to the morphological richness of the Icelandic language the main tagset is large and makes fine distinctions compared to related languages.
The original tagset contains about 700 tags, but the taggers have been developed/trained using a reduced version of the tagset, containing about 600 tags.
Type information for proper nouns (named-entity classification) has been removed and only one tag for numerical constants is used \citep{lof11}.

Each tag in the tagset comprises word class information and morphological features.
%It consists of 662 possible tags: 192 noun tags, 163 pronoun tags, 144 adjective tags, 82 verb tags, 27 numeral tags, 24 article tags, 16 punctuation tags, 9 adverb/preposition tags, 3 conjunction tags and 1 tag for foreign words and unanalysed words, respectively.
Each character in the tag has a particular function.
The first character denotes the word class.
For each word class there is a predefined number of additional characters (at most six) which describe morphological features, like gender, number and case for nouns, degree and declension for adjectives, voice, mood and tense for verbs, etc.

Table \ref{tab:semantics} shows the semantics of the noun tags.
Consider, for example, the tag ``\emph{nken}''.  The first letter, ``\emph{n}'', denotes the word class ``\emph{nafnor{\dh}}'' (noun), the second letter, ``\emph{k}'', denotes the gender ``\emph{karlkyn}'' (masculine), the third letter, ``\emph{e}'', denotes the number ``\emph{eintala}'' (singular) and the last letter, ``\emph{n}'', denotes the case ``\emph{nefnifall}'' (nominative case).

\begin{table}
\begin{center}
\begin{tabular}{lll}
\hline
\hline
Char\# & Category/Feature & Symbol -- semantics \\
\hline
1 & Word class & {\bf n}--noun \\
2 & Gender & {\bf k}--masculine, {\bf v}--feminine, {\bf h}--neuter, {\bf x}--unspecified  \\
3 & Number & {\bf e}--singular, {\bf f}--plural \\
4 & Case & {\bf n}--nominative, {\bf o}--accusative, {\bf {\th}}--dative, {\bf e}--genitive  \\
5 & Article & {\bf g}--with suffixed definite article \\
6 & Proper noun & {\bf s}--proper noun \\
\hline
\hline
\end{tabular}
\caption{The semantics of the noun tags}
\label{tab:semantics}
\end{center}
\end{table}

To give another example, consider the words ``\emph{fallegu hestarnir stukku}'' (the beautiful horses jumped).
The corresponding tag for ``\emph{fallegu}'' is ``\emph{lkenvf}'' denoting adjective, masculine, singular, nominative, weak declension, positive;
the tag for ``\emph{hestarnir}'' is ``\emph{nkfng}'' denoting noun, masculine, plural, nominative with suffixed definite article; and the tag for ``\emph{stukku}'' is ``\emph{sfg3f{\th}}'' denoting verb, indicative mood, active voice, 3-rd person, plural and past tense.
Note the agreement in gender, number and case.

A complete description of the Icelandic tagset can be found in the Appendix.

\section{IceMorphy}
\label{sec:iceMorphy}
The unknown word guesser, \emph{IceMorphy}, uses a familiar approach to unknown word guessing, i.e. it performs morphological analysis, compound analysis and ending analysis \citep{mik97,nak03}.
An additional important feature of \emph{IceMorphy} is its handling of \emph{tag profile gaps}.
\begin{enumerate}
\item{\bf Morphological analysis.}
The morphological analyser tries to classify an unknown word as a member of a particular morphological class.
For a given unknown word $w$, a morphological class is guessed depending on the morphological ending of $w$.
Then the stem $r$ of $w$ is extracted and all $k$ possible morphological endings for $r$ are generated resulting in search strings, $s_{i}$ ($i=1,\ldots,k$), such that $s_{i}=r+ending_{i}$.
A dictionary lookup is performed for $s_{i}$ until a word is found having the same morphological class as was originally assumed or no match was found.
If the search is successful, a tag is deduced using the assumed word class and the morphological ending of $w$.

\item{\bf Compound analysis.}
This part uses a straightforward method of repeatedly removing prefixes from unknown words and performing a lookup for the remaining part of the word.
If the remaining word part is not found in the dictionary it is sent to the morphological analysis for further processing.
If the lookup or morphological analysis deduces a tag \emph{t} for the remaining word part, the original word (without prefix removal) is given the same tag \emph{t}.

\item{\bf Ending analysis.}
The ending analyser is called if an unknown word can neither be deduced by morphological analysis nor by compound analysis.
This component uses a hand-written dictionary of endings along with an automatically generated one.
The former, which is looked up first, is mainly used to capture common endings for adjectives and verbs, for which numerous tags are possible.
\emph{IceMorphy} assumes that endings are different for capitalized words vs. other words and therefore uses two endings dictionaries, one for proper nouns and another for all other words.

\item{\bf Tag profile gaps.}
A \emph{tag profile gap} arises when a particular word, listed in the dictionary, has some missing tags in its tag profile (set of possible tags).
This, of course, presents problems to a disambiguator since its purpose is to select one single correct tag from all possible ones.
For each noun, adjective, or verb of a particular morphological class, \emph{IceMorphy} generates all possible tags for the given word.
\end{enumerate}

\section{IceTagger}
\label{sec:algorithm}
\emph{IceTagger} reads an untagged input file consisting of Icelandic sentences and produces an output file consisting of the words of the sentences augmented with the appropriate PoS tags.
The tagger consists of the following phases:
\begin{enumerate}
\item {\bf Tokenisation.}
The sequence of characters in the input file is split into simple tokens (linguistic units) like words, numbers and punctuation marks. In some cases, sentence segmentation needs to be carried out, i.e. the process of identifying when one sentence ends and another one begins.
\item {\bf Introduction of ambiguity.}
For each sentence to be tagged, the tag profile for each word, both known and unknown words, is introduced.
A word is looked up in a pre-compiled dictionary. If the word exists, i.e. the word is known, the corresponding tag profile for the word is returned.
In the case of a \emph{tag profile gap}, the unknown word guesser, \emph{IceMorphy}, is used for filling in the missing tags.
If the word does not exist in the dictionary, i.e. the word is unknown, \emph{IceMorphy} is used for guessing the possible tags.
At the end of this phase, a given word of a sentence can have multiple tags, i.e. ambiguity has been introduced.
\item {\bf Disambiguation.}
\emph{IceTagger} removes ambiguity by considering the context in which a particular word appears.
To be more specific, the tagger removes illegitimate tags from words based on context.
The tasks below are applied to one sentence at a time:
\begin{enumerate}
\item {\bf Identify idioms and phrasal verbs.}
Idioms, i.e. bigrams and trigrams, which are always tagged unambiguously are kept in a special dictionary.
A special dictionary is also used for recognising phrasal verbs, i.e. verb-particle pairs whose words are adjacent in text.
\item {\bf Apply local elimination rules.}
A sentence to be tagged is scanned from left to right and all tags of each word are checked in sequence.
Depending on the word class (the first letter of the tag) of the focus word, the token is sent to the appropriate disambiguation routine which checks a variety of disambiguation constraints applicable to the particular word class and the surrounding words.
At each step, only tags for the focus word are eliminated.
\item {\bf Apply global heuristics.}
Grammatical function analysis is performed, prepositional phrases are guessed, and the acquired knowledge is used to force feature agreement where appropriate. The heuristics are a collection of functions that guess the syntactic structure of the sentence and use it as an aid in the disambiguation process.
Additionally, specific heuristics are used to choose between supine and past participle verb forms, infinitive or active verb forms, and ensuring agreement between reflexive pronouns and their antecedents.
At last, the default heuristic is simply to choose the most frequent tag for a given word.
\end{enumerate}
\end{enumerate}

\section{TriTagger}
\emph{TriTagger} is statistical tagger based on a Hidden Markov Model (HMM).
The tagger is data-driven, i.e. it learns its language model from a tagged corpus.
The main advantage of data-driven taggers is that they are language independent and no (or limited) human effort is needed for derivation of the model.
The algorithm used by the tagger is as follows (consult \citep{bra00} for full details):

\begin{enumerate}
\item {\bf Tokenisation.}
\emph{TriTagger} uses the tokenisation method described in section \ref{sec:algorithm}.
\item {\bf Introduction of ambiguity.}
Known words are handled in the manner described in section \ref{sec:algorithm}.
Since \emph{TriTagger} is language independent, it has no knowledge of Icelandic morphology.
Suffix analysis is, therefore, the default method for guessing possible tags for unknown words.
On the other hand, since \emph{IceMorphy} already exists, it can be called from within \emph{TriTagger} (see section \ref{sec:tritagger_usage}).
In that case, \emph{TriTagger} will use tags provided by \emph{IceMorphy} if \emph{IceMorphy} can use morphological analysis (as opposed to ending analysis or default handling) to guess the tags for an unknown word.
For other unknown words, suffix analysis is carried out.
\item {\bf Disambiguation.}
The states of the HMM represent pair of tags and the model emits words each time it leaves a state. A trigram tagger finds an assignment of PoS to words by optimising the product of lexical probabilities and contextual probabilities.
Lexical probability is the probability of observing word \emph{i} given PoS \emph{j} ($p(w_{i}|t_{j})$) and contextual probability is the probability of observing PoS \emph{i} given \emph{k} previous PoS ($p(t_{i}|t_{i-1},t_{i-2}, \ldots ,t_{i-k})$; $k=2$ for a trigram model).
A sentence is tagged by assigning it the tag sequence which receives the highest probability by the model.
\end{enumerate}

The probabilities of the model are estimated from a training corpus using maximum likelihood estimation.
Thus, before \emph{Tritagger} can be used it needs to be trained on a tagged corpus.
A pre-trained model named \emph{otb}, derived from the \emph{IFD} corpus, can be found in the {\bf ngrams/models} directory.
Training of the tagger is described in section \ref{sec:train}.

\section{IceStagger}
\emph{IceStagger} \citep{lof13} is a modified version of the Stockholm Tagger (Stagger) \citep{ost13}, an open-source implementation of the Averaged Perceptron tagger by  \cite{col02}. 

The Averaged Perceptron algorithm uses a feature-rich model that can be trained efficiently.
Features are modeled using \emph{feature functions} of the form
$\phi(h_i,t_i)$ for a history $h_i$ and a tag $t_i$.
The history $h_i$ is a complex object modeling different aspects of the sequence
being tagged. It may contain previously assigned tags in the sequence to be
annotated, as well as other contextual features such as the form of the
current word, or whether the current sentence ends with a question mark.
Intuitively, the job of the training algorithm is to find out which feature
functions are good indicators that a certain tag $t_i$ is associated with a
certain history $h_i$.

A model consists of feature functions $\phi_s$, each paired with a
\emph{feature weight} $\alpha_s$ which is to be estimated during training.
The scoring function is defined over entire sequences, which in a PoS
tagging task typically means sentences. For a sequence of words $w$ of length
$n$ in a model with $d$ feature functions, the scoring function is defined as:
$$ \mathit{score}(w,t) = \sum_{i=1}^n \sum_{s=1}^d \alpha_s\phi_s(h_i,t_i) $$

Training the model is done in an error-driven fashion: tagging each sequence
in the training data with the current model, and adding to the feature weights
the difference between the corresponding feature function for the correct
tagging, and the model's tagging.

During tagging, the highest scoring sequence of tags is computed:
$$ \bar{t} = \operatorname{arg\,max}_t \mathit{score}(w,t) $$

\section{IceParser}
\emph{IceParser} is an incremental finite-state parser.
The parser comprises a sequence of finite-state transducers, each of which uses a collection of regular expressions to specify which syntactic patterns are to be recognised.
The purpose of each transducer is to add syntactic information into the recognised substrings of the input text.

\emph{IceParser} is designed to produce annotations according to an annotation scheme described in \citep{lof06c}.
The parser consists of two main components: a phrase structure module and a syntactic functions module.

The purpose of the phrase structure module is to add brackets and labels to input sentences to indicate phrase structure.
The output of one transducer serves as the input to the following transducers in the sequence.
The syntactic annotation is performed in a bottom-up fashion, i.e. deepest constituents are analysed first.

Both simple phrase structures and complex structures are recognised.
Since the parser is based on finite-state machines, each phrase structure does not contain a structure of the same type.
Complex structures contain other structures, whereas simple structures do not.

Two labels are attached to each marked constituent: the first one denotes the beginning of the constituent, the second one denotes the end (e.g. [NP \ldots NP]).
The main labels are \textbf{AdvP}, \textbf{AP}, \textbf{NP}, \textbf{PP} and \textbf{VP} -- the standard labels used for syntactic annotation (denoting adverb, adjective, noun, prepositional and verb phrase, respectively).
Additionally, the labels \textbf{CP}, \textbf{SCP}, \textbf{InjP}, \textbf{APs}, \textbf{NPs} and \textbf{MWE} are used for marking coordinating conjunctions, subordinating conjunctions, interjections, a sequence of adjective phrases, a sequence of noun phrases, and multiword expressions, respectively.

The purpose of the syntactic functions module is to add functional tags to denote grammatical functions.
The input to the first transducer in this module is the output of the last transducer in the phrase structure module, i.e. it is assumed that the syntactic functions module receives text that has been annotated with constituent structure.
As in the phrase structure module, the output of one transducer serves as the input to the following transducers in the sequence.

Four different types of syntactic functions are annotated: genitive qualifiers, subjects, objects/complements and temporal expressions.
Curly brackets are used for denoting the beginning and the end of a syntactic function, and special function tags are used for labels (*QUAL, *SUBJ, *OBJ/*OBJAP/*OBJNOM/*IOBJ/*COMP, *TIMEX).
Please refer to \citep{lof06c}, for a thorough description of the annotation scheme used.

In total, \emph{IceParser} consists of about 25 finite-state transducers.
The parser is implemented in Java and the lexical analyser generator tool JFlex (http://jflex.de/).

\section{Lemmald}
\emph{Lemmald} is a mixed method lemmatizer for Icelandic.
It achieves good performance by relying on \emph{IceTagger} for tagging and the \emph{IFD} corpus for training.
\emph{Lemmald} combines the advantages of data-driven machine learning with linguistic insights to maximize performance.

To achieve this, it makes use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge \citep{ant08}.
Accuracy of the lemmatisation is further improved using an add-on which connects to \emph{BÍN}.

Given correct tagging, the system lemmatises Icelandic text with an accuracy of 99.55\%.

\section{IceNER}
\emph{IceNER} is a named entity recoginiser for Icelandic, based on linguistic rules.
The system marks persons, companies, locations and events.
Evaluation has shown that \emph{IceNER} achieves an overall F-score of 71.5\% without using a gazette list, and 79.3\% when using a gazette list of only 523 names \citep{try09}.

The system reads the text several times, applying the strictest rules first and then more relaxed rules.
\emph{IceNER} is built on two subsystems.
The first, called NameScanner, uses regular expressions to create lists of named entities based on endings such as ``-
son'', ``-dóttir'', and abbreviations like ``hf'', ``ehf''.
It also generate lists of words that can be of significance, such as professional titles, words that imply a location, a company or a person, etc.

The second subsystem, NameFinder, reads these lists, and breaks up combinations of words if a name is made of more than a single word. If, for example, the name ``Ingibjörg Sólrún Gísladóttir'' appears in the name list, then entries for ``Ingibjörg Sólrún'', ``Ingibjörg'', ``Sólrún'', ``Gísladóttir'' and ``Sólrún Gísladóttir'' will also be added.
The NameFinder will also read the text itself, after it has been run through \emph{IceTagger}. 
The NameFinder will then use the name lists and rules based on the context in which entities appear
to try to categorize the entities. 

\section{BÍN}
\label{sec:bin}
\emph{BÍN} (``Beygingarlýsing íslensks nútímamáls'') is a comprehensive full form database of modern Icelandic inflections \cite{kri05}, developed at the \emph{Árni Magnússon Institute for Icelandic Studies}. BÍN contains about 280,000 paradigms, with over 5.8 million inflectional forms for common nouns, proper nouns, adjectives, verbs, and adverbs.

Due to licensing issues, \emph{BÍN} cannot be distributed with \emph{IceNLP}. However, \emph{IceNLP} contains several shell scripts to extract data from \emph{BÍN} for the purpose of using it in it's taggers.  As stated in Section \ref{sec:intro}, the accuracy of the taggers increases considerably when extending their dictionaries with data from \emph{BÍN}.

The shell scripts rely on a database  dump of \emph{BÍN}, which is available for download from \url{bin.arnastofnun.is}.  The dump file has the name \emph{SHsnid.csv}.

Copy this file into the {\bf dict/BIN} directory.  Then run the {\bf extractBinData.sh} script, which will generate dictionaries with data from \emph{BÍN} for the three taggers: \emph{IceTaggger}, \emph{TriTagger}, and \emph{IceStagger}.

To run a tagger with an extended dictionary, please refer to Section \ref{sec:usage}.

\section{File format}
The \emph{IceNLP} toolkit uses \textbf{UTF8} character encoding for all files.
It is thus assumed that dictionaries and input files are encoded in UTF8 format.
Moreover, output files, generated by the tool, will be encoded in UTF8.

\subsection{Tagging}
\subsubsection{Input file}
The input file to be tagged can have one of four formats:
\begin{enumerate}
\item {\bf One token/tag pair per line} (only used by \emph{IceStagger}). An empty line (the newline character) is required between sentences.
\item {\bf One token per line}. An empty line (the newline character) is required between sentences.
\item {\bf One sentence per line}.
\item {\bf Other format}. This entails that a sentence can span more than one line, or that there can be more than one sentence per line in the input file.
\end{enumerate}

\subsubsection{Output file}
The taggers can return output in either of two formats:
\begin{enumerate}
\item {\bf One token/tag per line} (or one token/tag/lemma per line).
The token appears first in each line followed by the tag(s) selected by the tagger (and the lemma if lemmatisation is needed (see Section \ref{sec:icetagger_usage}).
If the token is an unknown word the string \emph{<UNKNOWN>} appears after the tag.
There is some additional output possible in this format, which we will discuss in Section \ref{sec:icetagger_usage}.
Here is an example of this output format:
\begin{verbatim}
ég fp1en
opnaði sfg1eþ
dyrnar nvfog
, ,
steig sfg1eþ
inn aa
og c
sparkaði sfg1eþ
hvítum lkeþsf
brennivínspoka nkeþ <UNKNOWN>
með aþ
sunddóti nheþ <UNKNOWN>
til ae
hliðar nvee
. .
\end{verbatim}

\item {\bf One sentence per line. }
Each line consists of a sentence in which each token is followed by the tag (and possibly the lemma), selected by the tagger.
Here is the example above in this format:
\begin{verbatim}
ég fp1en opnaði sfg1eþ dyrnar nvfog , , steig sfg1eþ inn aa og c sparkaði sfg1eþ
hvítum lkeþsf brennivínspoka nkeþ með aþ sunddóti nheþ til ae hliðar nvee . .
\end{verbatim}
\end{enumerate}

\subsubsection{Dictionaries}
\label{sec:dict}
The {\bf dict} directory contains a copy of the default dictionaries and wordlists that are part of the \emph{IceNLPCore.jar} file. The files in the {\bf dict} directory can be changed by the user and parameters for individual tools of \emph{IceNLP} can be used to point to these dictionaries in case the user wants to change the default behaviour (see Section \ref{sec:usage}).

The dictionaries, which list words/endings and associated tags, used by \emph{IceTagger} have the following format: \\\\
$w_{1}=t_{11}$\_$t_{12}$\_\ldots\_$t_{1s_{1}}$ \\
$w_{2}=t_{21}$\_$t_{22}$\_\ldots\_$t_{2s_{2}}$ \\
\ldots \\
$w_{n}=t_{n1}$\_$t_{n2}$\_\ldots\_$t_{ns_{n}}$ \\

Here $n$ is the number of words/endings in the dictionary, $w_{i}$ is word/ending number $i$, $t_{ik}$ is the $k^{th}$ frequent tag for word/ending $i$, and $s_{i}$ is the number of tags for word/ending $i$ ($i=1{\ldots}n$).
Note that the above means that the tags for a given word/ending are sorted according to frequency -- the most frequent tag appears first in the list of tags for a given word/ending.

To illustrate, the following is a record from a dictionary for the word ``\emph{við}'' (see the Appendix for explanation of the individual tags): \\
\emph{við=ao\_fp1fn\_aþ\_aa} \\

Since \emph{TriTagger} bases its language model on frequencies, word and tag frequencies are needed in its dictionary. Thus, the frequency dictionary used by \emph{TriTagger} has the following format: \\\\
$w_{1}$ $f_{w_{1}}$ $t_{11}$ $f_{t_{11}}$ $t_{12}$ $f_{t_{12}}$ {\ldots} $t_{1s}$ $f_{t_{1s}}$ \\
$w_{2}$ $f_{w_{2}}$ $t_{21}$ $f_{t_{21}}$ $t_{22}$ $f_{t_{22}}$ {\ldots} $t_{2s}$ $f_{t_{2s}}$ \\
\ldots \\
$w_{n}$ $f_{w_{n}}$ $t_{n1}$ $f_{t_{n1}}$ $t_{n2}$ $f_{t_{n2}}$ {\ldots} $t_{ns}$ $f_{t_{ns}}$ \\

To illustrate, the following is a record from a frequency dictionary for the word ``\emph{við}'': \\
\emph{við 5810 ao 3673 fp1fn 1332 aa 507 aþ 298}

\subsection{Parsing}
\label{sec:fileFormatParsing}
\subsubsection{Input file}
The input to the parser are POS-tagged sentences.
The tags are assumed to be part of the tagset used in the \emph{IFD} corpus, i.e. the tagset used by \emph{IceTagger}.
From version 1.5.0 of \emph{IceNLP}, the parser also accepts tags that confirm to the revised Icelandic tagset, described in the documentation for MIM\_GOLD 20.5 (\url{https://repository.clarin.is/repository/xmlui/handle/20.500.12537/39}).
Furthermore, it is assumed that the input file has one sentence in each line.

Here is an example of the input format:
\begin{verbatim}
ég fp1en opnaði sfg1eþ dyrnar nvfog , pk steig sfg1eþ inn aa og c sparkaði sfg1eþ
hvítum lkeþsf brennivínspoka nkeþ með af sunddóti nheþ til af hliðar nvee . pl
\end{verbatim}

\subsubsection{Output file}
The output of the parser consists of the POS-tagged sentences with added syntactic information.
The parser either writes one sentences in each line or one phrase/syntactic function in each line.
Here is an example of the latter:

\begin{verbatim}
{*SUBJ> [NP ég fp1en ] }
[VP opnaði sfg1eþ ]
{*OBJ< [NP dyrnar nvfog ] }
, pk
[VP steig sfg1eþ ]
[AdvP inn aa ]
[CP og c ]
[VP sparkaði sfg1eþ ]
{*OBJ< [NP [AP hvítum lkeþsf ] brennivínspoka nkeþ ] }
[PP með af [NP sunddóti nheþ ] ]
[PP til af [NP hliðar nvee ] ]
. pl
\end{verbatim}

\section{Usage}
\label{sec:usage}
Java 1.6 runtime (or later) is required to run the programs.
Java is available for free from Oracle, \url{http://java.com}.

In this section, usage of the individual tools on Linux is described. 

\subsection{The tokeniser}
\label{sec:tok}
The tokeniser application is used for tokenising input files and converting between different file formats (the tokeniser performs both word tokenisation and sentence segmentation).

To start the application, open a terminal (command prompt), go to the \textbf{bat/tokenizer} directory and type in the following command:\\ \\
%\begin{center}
{\bf ./tokenize.sh} [param] \\ \\
The parameters are:
\begin{itemize}
\item \emph{-i <inpFile>}: The input file to be tokenised. The file has a particular input format which is described by the \emph{-if} parameter.
\item \emph{-o <outFile>}: The output file into which the tokens are written. The desired output format is described by the \emph{-of} parameter.
\item \emph{-if <inputFormat>}: This parameter describes the format of the input file.
The possible values are:
\begin{itemize}
\item \emph{0}: One token/tag per line, with an empty line between sentences.
\item \emph{1}: One token per line, with an empty line between sentences.
\item \emph{2}: One sentence per line.
\item \emph{3}: Other different format.
\end{itemize}
\item \emph{-of <outputFormat>}. This parameter describes the desired output format.
\begin{itemize}
\item \emph{1}: One token per line, with an empty line between sentences.
\item \emph{2}: One sentence per line.
\end{itemize}
\item \emph{-l <filename>}: filename is the name of a lexicon used by the tokeniser.
The purpose of the lexicon is to list the abbreviations and the multiword expressions (MWEs) that the tokeniser is supposed to recognise.
If this parameter is not supplied, the tokeniser uses the default resource file \emph{lexicon.txt} in the \emph{IceNLPCore.jar} file.
\item \emph{-c <count>}: The tokeniser quits after tokenising \emph{<count>} sentences.
\item \emph{-mwe}: Mark MWEs in the output. 
\item \emph{-sa}: Split abbreviations. Use this option if each abbreviation is to be splitted into individual parts.
\item \emph{-ns}: Not strict tokenisation. This means, for example, that strings like delta\$(4) are not broken apart. If this parameter is not supplied, i.e. strict tokenisation is preferred, then the above string will result in the following tokens: delta \$ ( 4 ).
\end{itemize}

For example, the following command:
\begin{verbatim}
./tokenize.sh -i test.txt -o test.out -if 2 -of 1
\end{verbatim}
runs the tokeniser on the input file \emph{test.txt} and writes to the output file \emph{test.out}.
The format of the input file is one sentence per line, and the desired output format is one token per line.

Furthermore, if the -i parameter is not provided, the tokeniser reads from standard input and writes to standard output. In that case, inputFormat=3 and outputFormat=1. For example, the following Linux command can be used to tokenize the string ``Ég á stóran hund. Sá er a.m.k. 10 kíló.'' (and write the output to the screen): 
\begin{verbatim}
 echo "Ég á stóran hund. Sá er a.m.k. 10 kíló." | ./tokenize.sh
\end{verbatim}

\subsection{SrxSegmentizer}
The \textit{SrxSegmentizer} splits sentences according to rules defined in an SRX file. Such SRX rules are included in the IceNLP distribution and the \textit{Segment} library is used internally to apply the rules.

Use the command \textbf{srxsegmentizer.sh}. Two parameters can be supplied, an input file
and an output file. If those are omitted, input is read from stdin and output written to
stdout.

\textbf{Example:}

\begin{verbatim}
./srxsegmentizer.sh testinput.txt output.txt 
(or, using stdin/stdout)
echo "Þetta er nr. 1 og a.m.k. fínt. Farið e.t.v. þangað." | ./srxsegmentizer.sh
\end{verbatim}

Output:
\begin{verbatim}
Þetta er nr. 1 og a.m.k. fínt. 
Farið e.t.v. þangað. 
\end{verbatim}

\subsection{IceTagger}
\label{sec:icetagger_usage}
To start \emph{IceTagger}, open a terminal, go to the \textbf{bat/icetagger} directory, and type in the following command:\\ \\
{\bf ./icetagger.sh} [parameters] \\ \\
The parameters can be supplied in two ways:
\begin{itemize}
\item \emph{-p <filename>}: This tells the application to read the parameters from the file \emph{filename}.
A default parameter file \emph{paramDefault.txt} can be found in the \textbf{bat/icetagger} directory.
This file has a number of attribute-value pairs whose values can be changed.
The parameters are described below.

In most cases, only the parameters \emph{INPUT\_FILE}, \emph{OUTPUT\_FILE}, \emph{LINE\_FORMAT} and \emph{OUTPUT\_FORMAT} need to be changed.
To understand fully some of the other parameters you need to consult \citep{lof08}.
\begin{itemize}
%\item \emph{INPUT\_MODE}: \emph{message|file}. \emph{message}: used if \emph{IceTagger} should act as a server accepting messages containing sentences to be tagged. The routing and communication protocol used is based on a publish-subscribe protocol
%This feature is under development and will be described in later releases.
%(see section \ref{sec:messageProtocol}).
%\emph{file}: Used if \emph{IceTagger} should read sentences from a file (see the description of the next parameter).
\item \emph{INPUT\_FILE}: The name of the input file to be tagged. The file has a particular input format which is described by the \emph{LINE\_FORMAT} parameter.
\item \emph{OUTPUT\_FILE}: The name of the output file. The file has a particular output format which is described by the \emph{OUTPUT\_FORMAT} parameter. 
\item \emph{FILE\_LIST}: The name of a file containing a list of file names (one per line) to be tagged. For each file name $F$ to be tagged the corresponding tagged output file is generated in the same directory as $F$ with the same name as $F$ but with ``.out'' appended. If this parameter is used then the parameters \emph{INPUT\_FILE} and \emph{OUTPUT\_FILE} are ignored. 
\item \emph{LINE\_FORMAT}: The format of the input file, 1=one token per line, 2=one sentence per line, 3=other format.
\item \emph{OUTPUT\_FORMAT}: The desired format of the output file, 1=one token per line, 2=one sentence per line.
\item \emph{SEPARATOR}: \emph{space|underscore}. Used for \emph{OUTPUT\_FORMAT=2}. Specifies the character used as a separator between a word and its tag.
\item \emph{SENTENCE\_START}: \emph{upper|lower}. \emph{upper}: Every sentence starts with an upper case letter. \emph{lower}: Every sentence starts with a lower case letter, except when the first word is a proper noun.
\item \emph{LOG\_FILE}: The name of a log file if one is desired. The log file will list debugging information.
\item \emph{FULL\_DISAMBIGUATION}: \emph{yes|no}. This applies to words which the tagger can not fully disambiguate. If this value is \emph{yes} the tagger will either select the tag with the highest frequency or call \emph{TriTagger} for full disambiguation (see next parameter). If the value is \emph{no} the tagger will return all the tags that could not be eliminated.
\item \emph{MODEL\_TYPE}: \emph{start|end|startend}. If \emph{start}, an n-gram model (see the \emph{MODEL} parameter) is used for choosing the word class during initial disambiguation, and then \textit{IceTagger} is used to disambiguate tags that are consistent with the chosen word class. If \emph{end}, the n-gram model is only run in the last phase to fully disambiguate words to which \emph{IceTagger} is not able to assign unambiguous tags.  If \emph{startend}, the n-gram model is used both at the start and in the last phase.  
\item \emph{FULL\_OUTPUT}: \emph{yes|no}. If \emph{yes} the tagger will write  subject-verb-object information and information on prepositional phrases to the output file and detailed information for unknown words. If \emph{no} then only unknown words are marked.
\item \emph{BASE\_TAGGING}: \emph{yes|no}. If \emph{yes} the tagger will only assign a single tag to each word based on maximum frequency.
\item \emph{TAG\_MAP\_DICT}: The name of the dictionary used for mapping the tags used internally by \textit{IceTagger} to some other tagset.
\item \emph{LEMMATIZE}: \emph{yes|no}. If \emph{yes} then \textit{IceTagger} outputs the lemma, in addition to the word and its tag. Note that the lemma is only written out if \emph{OUTPUT\_FORMAT}=1. 
\item \emph{STRICT}: \emph{yes|no}. Strict tokenisation or not. Used by the tokeniser, see section \ref{sec:tok}.
\item For typical use of \textit{IceTagger}, the user does not need to provide values for the following parameters, because as a default the corresponding files are read directly from the \emph{IceNLPCore.jar} file:
\begin{itemize}
\item \emph{MODEL}: The name of an n-gram model. The n-gram model is only used if the \emph{MODEL\_TYPE} parameter has a value (and if \emph{FULL\_DISAMBIGUATION=yes}). If \emph{MODEL\_TYPE} has no value then \emph{IceTagger} performs full disambiguation by selecting the tag with the highest frequency.
\item \emph{BASE\_DICT}: The name of the base dictionary of words and associated tags. Its format can be seen in section \ref{sec:dict}.
\item \emph{DICT}: The name of the main dictionary of words and associated tags. Its format can be seen in section \ref{sec:dict}.
\item \emph{IDIOMS\_DICT}: The name of the dictionary for idioms or multiword expressions and associated tags.
\item \emph{VERB\_PREP\_DICT}: The name of the dictionary for verb-preposition pairs and associated cases.
\item \emph{VERB\_OBJ\_DICT}: The name of the dictionary for verbs and corresponding cases for their objects.
\item \emph{VERB\_ADVERB\_DICT}: The name of the dictionary for verb-particle (phrasal verb) information.
\item \emph{ENDINGS\_BASE}: The name of the base dictionary listing possible tags for different endings. Used by \textit{IceMorphy}.
\item \emph{ENDINGS\_DICT}: The name of the main dictionary listing possible tags for different endings. Used by \textit{IceMorphy}.
\item \emph{ENDINGS\_PROPER\_DICT}: The name of the main dictionary listing possible tags for different proper name endings. Used by \textit{IceMorphy}.
\item \emph{PREFIXES\_DICT}: The name of the prefixes dictionary. Used by \textit{IceMorphy}.
\item \emph{TAG\_FREQUENCY\_FILE}: The name of the tag frequency file. This file is only used by \textit{IceMorphy} when \emph{BASE\_TAGGING}=\emph{yes}.
\item \emph{TOKEN\_DICT}: The name of the file used by the tokeniser to recognise abbreviations, see section \ref{sec:tok}.
\end{itemize}
\end{itemize}
\item The latter possibility is to supply the parameters through the command line. For example, by issuing commands like: \\ \\
\textbf{./icetagger.sh} -i <inputFile> -o <outputFile> -d <dictionary> -lf 2 \ldots, etc. \\ \\
The parameters supplied this way correspond to the attributes and values above.
The name of the parameters can be seen by typing: \textbf{./icetagger.sh -help} \\ \\
For running \textit{IceTagger} with all the default settings, issue either of the commands: 
\begin{itemize} 
\item \textbf{./icetagger.sh} -i <inputfile> -o <outputfile>
\item \textbf{./icetagger.sh} -f <filelist> 
\end{itemize}

Here, <filelist> is a name of a file containing a list of files (one per line) to be tagged.

If neither the -i/-o parameters nor the -f parameter are provided, \textit{IceTagger} reads from standard input and writes to standard output. For example, the following Linux command can be used to make \textit{IceTagger} tag the string ``Ég á stóran hund'' (and write the output to the screen): 
\begin{verbatim}
echo "Ég á stóran hund" | ./icetagger.sh 
\end{verbatim}

\end{itemize}

For increasing the accuracy of \emph{IceTagger}, the main dictionary of the tagger can be extended with data from \emph{BÍN}.  Once the data from \emph{BÍN} has been extracted (see Section \ref{sec:bin}), the parameter file \textbf{paramDefaultBin.txt} can be used for running \emph{IceTagger} with the extended dictionary.

\subsection{TriTagger}
\label{sec:tritagger_usage}
To start \emph{TriTagger}, open a terminal, go to the \textbf{bat/tritagger} directory, and type in the following command:\\ \\
{\bf ./tritagger.sh} [parameters] \\ \\
The parameters can be supplied in two ways:
\begin{itemize}
\item \emph{-p <filename>}: This tells the application to read the parameters from the file \emph{filename}. A default parameter file \emph{paramDefault.txt} can be found in the \textbf{bat/tritagger} directory.
This file has a number of attribute-value pairs whose values can be changed:

\begin{itemize}
\item \emph{INPUT\_FILE}: See section \ref{sec:icetagger_usage}.
\item \emph{OUTPUT\_FILE}: See section \ref{sec:icetagger_usage}.
\item \emph{FILE\_LIST}: See section \ref{sec:icetagger_usage}.
\item \emph{LINE\_FORMAT}: See section \ref{sec:icetagger_usage}.
\item \emph{OUTPUT\_FORMAT}: See section \ref{sec:icetagger_usage}.
\item \emph{SENTENCE\_START}: See section \ref{sec:icetagger_usage}.
\item \emph{CASE\_SENSITIVE}: \emph{yes|no}. The default is \emph{no} which means that \emph{TriTagger} does case-insensitive lookup into the main dictionary for the first word of a sentence. If that fails, the tagger tries case-sensitive lookup. If this parameter is set to \emph{yes}, then case-insensitive lookup is not performed. 
\item \emph{NGRAM}: \emph{2}=bigrams, \emph{3}=trigrams.
\item For typical use of \textit{TriTagger}, the user does not need to provide values for the following parameters, because as a default the corresponding files are read directly from the \emph{IceNLPCore.jar} file:
\begin{itemize}
\item \emph{MODEL}: The name of the model derived from a training corpus. The model consists of a n-gram file, a lexicon and a file with lambda (smoothing) parameters. This model name should not have any extension. For example, if \emph{MODEL}=otb, then the program will load the files \emph{otb.ngram}, \emph{otb.lex} and \emph{otb.lambda} (see section \ref{sec:train}).
\item \emph{STRICT}: See section \ref{sec:icetagger_usage}.
\item \emph{TOKEN\_DICT}: See section \ref{sec:icetagger_usage}.
\item \emph{ICEMORPHY}: \emph{yes|no}. If \emph{yes} then \emph{TriTagger} uses tags guessed by \emph{IceMorphy} for unknown words that go successfully through the morphological analysis component of \emph{IceMorphy}. Otherwise, suffix handling of unknown words is used.
\item \emph{DICT}: Main dictionary used by \emph{IceMorphy}. See section \ref{sec:icetagger_usage}.
\item \emph{BASE\_DICT}: Base dictionary used by \emph{IceMorphy}. See section \ref{sec:icetagger_usage}.
\item \emph{ENDINGS\_BASE}: See section \ref{sec:icetagger_usage}.
\item \emph{ENDINGS\_DICT}: See section \ref{sec:icetagger_usage}.
\item \emph{ENDINGS\_PROPER\_DICT}: See section \ref{sec:icetagger_usage}.
\item \emph{PREFIXES\_DICT}: See section \ref{sec:icetagger_usage}.
\end{itemize}
\item \emph{BACKUP\_DICT}: The name of a backup dictionary. If lookup into the model dictionary fails then this backup dictionary is used.
\item \emph{IDIOMS\_DICT}: See section \ref{sec:icetagger_usage}.
\end{itemize}
\item The latter possibility is to supply the parameters through the command line. For example, by issuing commands like: \\ \\
\textbf{./tritagger.sh} -i <inputFile> -o <outputFile> -m <model> -lf 2 \ldots, etc. \\ \\
The parameters supplied this way correspond to the attributes and values above.
The name of the parameters can be seen by typing: \textbf{./tritagger -help}

For running \textit{TriTagger} with all the default settings, issue either of the commands: 
\begin{itemize} 
\item \textbf{./tritagger.sh} -i <inputfile> -o <outputfile>
\item \textbf{./tritagger.sh} -f <filelist> 
\end{itemize}
Here, <filelist> is a name of a file containing a list of files (one per line) to be tagged.

If neither the -i/-o parameters nor the -f parameter are provided, \textit{TriTagger} reads from standard input and writes to standard output.
For example, the following Linux command can be used to make \textit{TriTagger} tag the string ``Ég á stóran hund'' (and write the output to the screen): 
\begin{verbatim}
echo "Ég á stóran hund" | ./tritagger.sh 
\end{verbatim}
\end{itemize}

For increasing the accuracy of \emph{TriTagger}, the main dictionary of the tagger can be extended with data from \emph{BÍN}.  Once the data from \emph{BÍN} has been extracted (see Section \ref{sec:bin}), the parameter file \textbf{paramDefaultBin.txt} can be used for running \emph{TriTagger} with the extended dictionary.

As mentioned above, one of the files resulting from the training phase is a lexicon file (with the extension \textit{.lex}), containing the tag profile for each word.
In some cases one might want to extend this file, for example by adding data to it from some other data resource \emph{BÍN} than the training corpus.
If one does only want \textit{TriTagger} do use data derived from the training corpus (but not also from the other data resource) for suffix handling, then a single line containing the following string can be put into the \textit{.lex} file right after the last entry (word) derived from the training corpus:
\begin{verbatim}
[NOSUFFIXES] 
\end{verbatim}
During the loading of the lexicon, \textit{TriTagger} will then not use entries in the lexicon, that appear after this specially marked string, for suffix handling.

\subsubsection{Training}
\label{sec:train}
Before \emph{Tritagger} can be used it needs to be trained on a tagged corpus.
A pre-trained model (otb), derived from the \emph{IFD} corpus, is part of the \emph{IceNLPCore.jar} file and can also be found in the {\bf ngrams/models} directory.
For illustration, we now describe how to train a new model using any training corpus, for example the small corpus \textbf{ngrams/corpus.txt}.
For training,  \emph{Perl}\footnote{http://www.perl.org/.} is needed.

\begin{enumerate}
\item Open a terminal and go to the \textbf{ngrams} directory.
\item Type {\bf bash train corpus.txt corpus -e}, where \emph{bash} is a shell, \emph{train} is the program for training, \emph{corpus.txt} is the training corpus, \emph{corpus} is the name of the output model and \emph{-e} signifies empty lines between sentences in the training corpus.
If all goes well, four files, corpus.ngrams, corpus.lex, corpus.orig.lex and corpus.lambda will be created in the \textbf{ngrams/models} directory.
\item At this point the file corpus.lex (and corpus.orig.lex) is a lexicon derived from the corpus.txt training corpus and can be used directly with \emph{TriTagger} as described in section \ref{sec:tritagger_usage}.
%However, this lexicon has probably numerous missing tags (\emph{tag profile gaps}).
%\emph{IceMorphy} can be used to fill in the gaps by:
%\begin{enumerate}
%\item Open a command prompt (not Cygwin) and go to the \textbf{Ngrams} directory.
%\item Type \textbf{fillDict 01}. \emph{IceMorphy} will generate the dictionary 01TM.filled.dict in the \textbf{Ngrams/models} directory.
%\item Type \textbf{fillDictFreq 01}. This command uses files 01TM.orig.lex and 01TM.filled.dict to generate a new tag filled lexicon, 01TM.lex, in the \textbf{Ngrams/models} directory.
%\end{enumerate}
\end{enumerate}

\subsection{IceStagger}
To start \emph{IceStagger} for tagging text, open a terminal, go to the \textbf{bat/icestagger} directory, and type in the following command:\\ \\
{\bf ./icestagger.sh} [parameters] \\ \\
The (main) parameters are the following (the full description of the possible parameters can be found in the README file in this directory):

\begin{itemize} 
\item -lang is: For tagging Icelandic, the value \emph{is} is needed for the \emph{lang} parameter.
\item -modelfile <filename>: \emph{filename} is the name of a model generated during training.
\item -plain: For generating plain output, i.e. one token/tag pair per line.
\item -icemorphy <n>: \emph{n} is {\bf 0} (do not use IceMorphy), or {\bf 1} (use IceMorphy for filling \emph{tag profile gaps} and guessing the tag profile for unknown words), or {\bf 2} (only use IceMorphy for unknown words).
\item -tag <filename 1> <filename 2> \ldots <filename n>: Specifies tagging of \emph{n} files. This should be the last argument.  
\end{itemize}

There are two possible formats for the input files to be tagged:
\begin{itemize}
\item  A file with an \emph{.txt} extensions is assumed to contain raw text and will be tokenised by \emph{IceStagger's} tokeniser before tagging.  If there is only one file name in the input list, the output is written to standard output, otherwise each tagging output is written to a separate file.
\item A file with any other extension is assumed to contain a single token/tag pair in each line with an empty line between sentences.  The tag in the second column is used for evaluating the tagger's accuracy. In this case, the tagging output is written to standard output.
\end{itemize}
  
For example, the following command:

\begin{verbatim}
./icestagger.sh -modelfile otb.bin -lang is -plain -icemorphy 1 -tag sentences.txt
\end{verbatim}

uses the training model \emph{otb.bin} to tag the \emph{sentences.txt} file using \emph{IceMorphy}, generating \emph{plain} (one token/tag per line) output.

\subsubsection{Training}
To generate a model from a training corpus, the following (main) parameters can be used:

\begin{itemize}
\item -lang is: For training on Icelandic text.
\item -trainfile <filename>: \emph{filename} is the training corpus to be used.  The format is assumed to be one token/tag pair in each line with an empty line between sentences.
\item -lexicon <filename>: \emph{filename} is a lexicon in which  each line has 4 tab-separated fields: <word form, lemma, tag, frequency>.  The frequency can be 0.  The lexicon is optional.
\item -positers <n>:  Train the tagger with at most \emph{n} iterations.
\item -plain: For generating plain output, i.e. one token/tag pair per line.
\item -train: Specifies training mode.  This should be the last argument.
\end{itemize}

For example, the following command:

\begin{verbatim}
./icestagger.sh -trainfile otb.plain -modelfile otb.bin -positers 10 -lang is -train
\end{verbatim}

uses the training corpus \emph{otb.plain} to produce the training model \emph{otb.bin} using 10 iterations. 

A pre-trained model (otb), derived from the \emph{IFD} corpus, is part of the \emph{IceNLP} distribution and can be found in the {\bf models} directory of the \emph{bat/icestagger} directory.

For increasing the accuracy of \emph{IceStagger}, a lexicon with data from \emph{BÍN} can be provided during training.  Once the data from \emph{BÍN} has been extracted (see Section \ref{sec:bin}), the shell script \textbf{trainIceStaggerBin.sh} can be used for training.  Tagging can then be carried out using the \textbf{tagIceStaggerBin.sh} shell.

\subsection{Lemmald}
The lemmatizer can be used as part of \textit{IceTagger} by supplying the \textit{-lem} parameter and specifying output
format \textit{1}. See section \ref{sec:icetagger_usage} on \textit{IceTagger} usage for further information. 
An example of such usage is the following:

\begin{verbatim}
echo "Ég á stóran hund" | ./icetagger.sh -of 1 -lem 
\end{verbatim}

The same result can be achieved using \textbf{./lemmatize.sh} and that command also allows for lemmatizing
input that has already been tagged, for example using a different tagger. The parameters of \textbf{./lemmatize.sh}
are the following:
\begin{itemize}
 \item \textit{-i<file>}: The input file. If omitted the input is read from stdin.
 \item \textit{-o<file>}: The output file. If omitted the output is written to stdout.
 \item \textit{-h}: Display help.
 \item \textit{-lemmatizeTagged}: Indicates that the input is already tagged. Such input
      should have one token per line and each token should consist of a word and its tag.
\end{itemize}

\textbf{Example 1: Lemmatizing a plain text file}

\begin{verbatim}
./lemmatize.sh -i plaintext.txt -o myoutput.txt
(or, using stdin/stdout)
echo "Við erum æðislegar. Við kunnum alla dansana." | ./lemmatize.sh
\end{verbatim}

Reads the plain text file plaintext.txt and writes the result to myoutput.txt. \textit{IceTagger} is used for tagging
before \textit{Lemmald} lemmatizes.

Input:
\begin{verbatim}
Við erum æðislegar. Við kunnum alla dansana. 
\end{verbatim}

Output:
\begin{verbatim}
Við ég fp1fn
erum vera sfg1fn
æðislegar æðislegur lvfnsf
. . .

Við ég fp1fn
kunnum kvinna sfg1fþ
alla allur fokfo
dansana dans nkfog
. . .
\end{verbatim}

\textbf{Example 2: Lemmatizing input that is already tagged}

To lemmatize tagged input, with one token per line, each of which has a word form and a PoS tag, supply the parameter "-lemmatizeTagged".
The lemma is added between the word form and its tag.

\begin{verbatim}
./lemmatize.sh -i testinput.txt -o output.txt -lemmatizeTagged
(or, using stdin/stdout)
cat testinput.txt | ./lemmatize.sh -lemmatizeTagged 
\end{verbatim}

Input:
\begin{verbatim}
Ég fp1en
á sfg1en
stóran lkeosf
hund nkeo 
\end{verbatim}

Output:
\begin{verbatim}
Ég ég fp1en
á eiga sfg1en
stóran stór lkeosf
hund hundur nkeo 
\end{verbatim}

\subsection{IceMorphy}
The morphological analyser, \emph{IceMorphy}, can be used as a stand-alone application.
To start \emph{IceMorphy}, open a terminal, go to the \textbf{bat/icemorphy} directory, and type in the following command: \\ \\
{\bf ./icemorphy.sh} -p <paramFile> \\ \\
The format of the parameter file is similar to the format of the file used by \emph{IceTagger}.
Two default parameter files \emph{paramAnalyze.txt} and \emph{paramFill.txt} can be found in the \textbf{bat/icemorphy} directory.
The former is used for analysing words in a file, the latter for filling \emph{tag profile gaps} in a dictionary:
\begin{itemize}
\item {\bf Analysing}.
In this mode \emph{IceMorphy} accepts an input file consisting of one word in each line.
It looks up each word in the supplied dictionary (see the \emph{DICT} parameter) and fetches the corresponding tags if the word is found or guesses the possible tags if the word is unknown.
Unknown words are marked with a * at the end of each line in the output file.
Additionally, one of the strings <MORPHO>, <COMPOUND> or <ENDING> are printed after the *, signifying which module of \emph{IceMorphy} produced the result (see Sect. \ref{sec:iceMorphy}).
The analyser either returns all tags for each word (sorted by frequency) or only the most frequent tag.
This can be controlled by the \emph{MODE} parameter.
\item {\bf Filling}.
In this mode \emph{IceMorphy} accepts an input file (a dictionary) in the format described in section \ref{sec:dict}. For each word in the input file, the morphological analyzer generates the missing tags, i.e. it does \emph{tag profile gap} filling.
\end{itemize}

The parameters of the <paramFile> are described below:
\begin{itemize}
\item \emph{MODE}: \emph{all|one|fill}. all=analyze words and return all tags, one=analyze words and return the one most frequent tag, fill=fill tag profile gaps in a dictionary.
\item \emph{INPUT\_FILE}: The name of the input file to be either \emph{analysed} or \emph{filled}.
\item \emph{OUTPUT\_FILE}: The name of the output file.
\item \emph{LOG\_FILE}: The name of a log file if one is desired. The log file will list debugging information.
\item \emph{SEPARATOR}: \emph{space|equal}. Specifies the character used as a separator between a word and its tag(s).
\item \emph{TAGSEPARATOR}: \emph{space|underscore}. Specifies the character used as a separator between the tags.
\item For typical use of \textit{IceMorphy}, the user does not need to provide values for the following parameters, because as a default the corresponding files are read directly from the \emph{IceNLPCore.jar} file:
\begin{itemize}
\item \emph{DICT}: The name of the main dictionary of words and associated tags. See section \ref{sec:icetagger_usage}.
\item \emph{BASE\_DICT}: The name of the base dictionary. See section \ref{sec:icetagger_usage}.
\item \emph{ENDINGS\_BASE}: See section \ref{sec:icetagger_usage}.
\item \emph{ENDINGS\_DICT}: See section \ref{sec:icetagger_usage}.
\item \emph{ENDINGS\_DICT}: See section \ref{sec:icetagger_usage}.
\item \emph{ENDINGS\_PROPER\_DICT}: See section \ref{sec:icetagger_usage}.
\item \emph{PREFIXES\_DICT}: See section \ref{sec:icetagger_usage}.
\item \emph{TAG\_FREQUENCY\_FILE}: See section \ref{sec:icetagger_usage}.
\end{itemize}
\end{itemize}


\subsection{IceParser}
To start the parser, open a terminal, go to the \textbf{bat/iceparser} directory and type in the following command:\\ \\
{\bf ./iceParser.sh} -i <inputFile> -o <outputFile> [optional param] \\ \\
The optional parameters are:
\begin{itemize}
\item \emph{-f}: \emph{IceParser} annotates grammatical functions (as well as constituent structure).
\item \emph{-l}: \emph{IceParser} writes out one phrase/syntactic function in each line. Otherwise, the output is one sentence per line.
\item \emph{-a}: \emph{IceParser} uses feature agreement rather than only relying on word order, when grouping words into noun phrases and annotating subjects of verbs.
\item \emph{-e}: \emph{IceParser} attaches a question mark (?) to the end of labels for NPs and/or subjects to denote possible grammatical errors.
\item \emph{-m}: \emph{IceParser} merges function labels with phrase labels.
\item \emph{-json}: \emph{IceParser} writes the output in json format.
\item \emph{-xml}: \emph{IceParser} writes the output in xml format.
\end{itemize}

Note that \emph{IceParser} assumes that the input file has one sentence per line.
Each line consists of a sequence of word-tag pairs (see \ref{sec:fileFormatParsing}).

A \emph{grammar definition corpus}, a representative collection of about 200 Icelandic sentences \citep{lof06c} is provided in the \textbf{bat/iceparser} directory.
The name of the file is \emph{200sent\_func.gdc} and it has been hand-annotated with constituent structure and grammatical functions.
The original text is in the file \emph{200sent.txt}.

The following command makes \emph{IceParser} annotate the original file with constituent structure and grammatical functions: \\ \\
{\bf ./iceParser.sh} -i 200sent.txt -o 200sent.out -f -l \\

The hand-annotated file \emph{200sent\_func.gdc} and the parser generated file \emph{200sent.out} can then be compared by using utilities like Unix \emph{diff}.

\emph{IceParser} can, additionally, be made to generate output files corresponding to the result of each of its individual finite-state transducers.
In that case, type in: \\ \\
{\bf ./iceparserOut.sh} -i 200sent.txt -o 200sent.out -p . \\

The third command-line parameter above denotes the path for the output files. The output files are text files with the \emph{.out} ending.

\subsection{IceNER}
To start \emph{IceNER}, open a terminal, go to the \textbf{bat/iceNER} directory and type in the following command:\\ \\
{\bf ./iceNER.sh} -i <inputFile> -o <outputFile> [optional param] \\ \\
The optional parameters are:
\begin{itemize}
\item \emph{-l <filename>}: \emph{IceNER} uses <filename> as a gazette list (a list which contains pre-catagorised entities).
\item \emph{-g}: \emph{IceNER} runs in greedy mode. In this mode, all unmarked named entities that follow the prepositions ``á'' and ``í'' are marked as locations and names with the pattern ``Xxxx Xxxx'' are marked as persons.
\end{itemize}


\subsection{Dictionaries}
The dictionaries used by the system are located in the \textbf{dict} directory.
The dictionaries which start with the prefix \emph{otb} have been automatically generated from the \emph{IFD} corpus.
For example, the main dictionary, \emph{dict/icetagger/otb.dict}, was generated by extracting all the words from the \emph{IFD} corpus along with all the tags that appeared with each word.
The format of this dictionary is described in section \ref{sec:dict}.

Two base dictionaries are used by the system.
These are \emph{dict/icetagger/baseDict.dict} and \emph{dict/icetagger/baseEndings.dict}.
The former is mainly used for words and associated tags of the closed word classes, e.g. conjunctions, pronouns, prepositions and irregular verbs.
A word is first looked up in this base dictionary before the main dictionary (\emph{DICT}) is searched.

The latter is a hand-compiled list of endings and associated tags.
An ending is first looked up in this list before the endings dictionary supplied by the user (\emph{ENDINGS\_DICT}) is searched.

\section{Demo application}
\label{sec:demo}
A small demo application is part of this release.
The purpose of the application is to analyse (tag and parse) text specified by the user.
To start the application, open a terminal, go to the \textbf{bat/demo} directory and type in the following command:\\ \\

{\bf ./tagAndParseGUI.sh} [inputFile]  \\ \\

The input file is optional.
If not input file is specified, it is assumed that the user will type in the text to be analysed.

For example, the file \emph{test.txt} in the \textbf{bat/demo} directory can be analysed, by typing:
\begin{verbatim}
./tagAndParseGUI.sh test.txt
\end{verbatim}

Tagging and parsing can also be tested by running the \textbf{/.tagAndParse.sh} command in the \textbf{bat/demo} directory.  

In that case, the \emph{test.txt} file is used as the input to the tagger.  The output of the tagger is then piped into \textit{IceParser}, which finally produces the file \emph{parse.out} as output.

\section{Building from source}
To build  \emph{IceNLP} from source, you need the following three tools:

\begin{enumerate}
\item{\bf Java Development Kit (JDK)}.
The JDK includes tools useful for developing and testing programs written in the Java programming language and running on the Java platform.  JDK is available for free from Oracle.

\item{\bf JFlex}.
JFlex is a lexical analyzer generator (also known as scanner generator) for Java, written in Java.
JFlex is availble for free from \url{http://jflex.de}

\item{\bf Apache Ant}.
Apache Ant is a Java library and command-line tool whose mission is to drive processes described in build files as targets and extension points dependent upon each other.
Ant is available for free from \url{http://ant.apache.org/}
\end{enumerate}

For example, to build \emph{IceNLPCore}, go to the directory  {\bf icenlp/core} and issue this command: {\bf ant}

\emph{Ant} will then use the instructions given in the \emph{build.xml} file to build each individual component of \emph{IceNLP}.

Note that before building you will need to increase the memory used by \emph{JFlex}: Go to the directory of JFlex
and edit the \emph{jflex} file.
At the bottom of this file, change:

\verb|$JAVA -Xmx128m -jar $JFLEX_HOME/lib/jflex-1.x.y.jar $@|

to

\verb|$JAVA -Xmx2048m -jar $JFLEX_HOME/lib/jflex-1.x.y.jar $@|

\newpage
\begin{spacing}{1.0}
\addcontentsline{toc}{section}{References}
\bibliographystyle{abbrvnat}
\bibliography{ref}
\end{spacing}

\newpage
\begin{spacing}{1.0}
\appendix
\addcontentsline{toc}{section}{Appendix}
\section{The Icelandic tagset}
\begin{table}[h]
\begin{center}
{\scriptsize
\caption{The Icelandic tagset}
%\begin{longtable}{lll}
\begin{tabular}{lll}
\hline
\hline

Char\# & Category/Feature & Symbol -- semantics \\
\hline
%\endhead
1 & Word class & {\bf n}--noun \\
2 & Gender & {\bf k}--masculine, {\bf v}--feminine, {\bf h}--neuter, {\bf x}--unspecified  \\
3 & Number & {\bf e}--singular, {\bf f}--plural \\
4 & Case & {\bf n}--nominative, {\bf o}--accusative, {\bf {\th}}--dative, {\bf e}--genitive  \\
5 & Article & {\bf g}--with suffixed definite article \\
6 & Proper noun & {\bf s}--proper name \\
\hline
1 & Word class & {\bf l}--adjective \\
2 & Gender & {\bf k}--masculine, {\bf v}--feminine, {\bf h}--neuter \\
3 & Number & {\bf e}--singular, {\bf f}--plural \\
4 & Case & {\bf n}--nominative, {\bf o}--accusative, {\bf {\th}}--dative, {\bf e}--genitive  \\
5 & Declension & {\bf s}--strong declension, {\bf v}--weak declension, {\bf o}--indeclineable  \\
6 & Degree & {\bf f}--positive, {\bf m}--comparative, {\bf e}--superlative \\
\hline
1 & Word class & {\bf f}--pronoun \\
2 & Subcategory & {\bf a}--demonstrative, {\bf b}--reflexive, {\bf e}--possessive, {\bf o}--indefinite, \\
  & & {\bf p}--personal, {\bf s}--interrogative, {\bf t}--relative  \\
3 & Gender/Person & {\bf k}--masculine, {\bf v}--feminine, {\bf h}--neuter/{\bf 1}--$1^{st}$ person, {\bf 2}--$2^{nd}$ person \\
4 & Number & {\bf e}--singular, {\bf f}--plural \\
5 & Case & {\bf n}--nominative, {\bf o}--accusative, {\bf {\th}}--dative, {\bf e}--genitive  \\
\hline
1 & Word class & {\bf g}--article \\
2 & Gender & {\bf k}--masculine, {\bf v}--feminine, {\bf h}--neuter \\
3 & Number & {\bf e}--singular, {\bf f}--plural \\
4 & Case & {\bf n}--nominative, {\bf o}--accusative, {\bf {\th}}--dative, {\bf e}--genitive  \\
\hline
1 & Word class & {\bf t}--numeral \\
2 & Category & {\bf f}--alpha, {\bf a}--numeric \\
3 & Gender & {\bf k}--masculine, {\bf v}--feminine, {\bf h}--neuter  \\
4 & Number & {\bf e}--singular, {\bf f}--plural \\
5 & Case & {\bf n}--nominative, {\bf o}--accusative, {\bf {\th}}--dative, {\bf e}--genitive  \\
\hline
%\pagebreak
1 & Word class & {\bf s}--verb (except for past participle) \\
2 & Mood & {\bf n}--infinitive, {\bf b}--imperative, {\bf f}--indicative, {\bf v}--subjunctive, \\
  & & {\bf s}--supine, {\bf l}--persent participle \\
3 & Voice & {\bf g}--active, {\bf m}--middle  \\
4 & Person & {\bf 1}--$1^{st}$ person, {\bf 2}--$2^{nd}$ person, {\bf 3}--$3^{rd}$ person,  \\
5 & Number & {\bf e}--singular, {\bf f}--plural \\
6 & Tense & {\bf n}--present, {\bf þ}--past\\
\hline
1 & Word class & {\bf s}--verb (past participle) \\
2 & Mood & {\bf þ}--past participle\\
3 & Voice & {\bf g}--active, {\bf m}--middle  \\
4 & Gender & {\bf k}--masculine, {\bf v}--feminine, {\bf h}--neuter \\
5 & Number & {\bf e}--singular, {\bf f}--plural \\
6 & Case & {\bf n}--nominative, {\bf o}--accusative, {\bf {\th}}--dative, {\bf e}--genitive  \\
\hline
1 & Word class & {\bf a}--adverb and preposition \\
2 & Category & {\bf a}--does not govern case, {\bf u}--exclamation, \\
  & & {\bf o}--governs accusative, {\bf þ}--governs dative, {\bf e}--governs genitive \\
3 & Degree & {\bf m}--comparative, {\bf e}--superlative \\
\hline
1 & Word class & {\bf c}--conjunction \\
2 & Category & {\bf n}--sign of infinitive, {\bf t}--relative conjunction, \\
\hline
1 & Word class & {\bf e}--foreign word\\
\hline
1 & Word class & {\bf x}--unanalyzed word \\
\hline
\hline
%\end{longtable}
\end{tabular}
}
\end{center}
\end{table}
\end{spacing}

%\end{spacing}
\end{document}