Skip to content

Commit

Permalink
prep for further work backref
Browse files Browse the repository at this point in the history
  • Loading branch information
JD Bothma committed Nov 18, 2014
1 parent 6885852 commit aaec9fb
Show file tree
Hide file tree
Showing 3 changed files with 61 additions and 5 deletions.
6 changes: 4 additions & 2 deletions further_work_notes.txt
Expand Up @@ -35,7 +35,7 @@ Evidence sources
** Bootstrap models, perhaps iterate with the output of one round informing the next
*** iterations might converge towards a maximal score of an automated evaluation, like OntoUSP converges to a mathematical ideal set of edge weights
** perhaps improve accuracy, e.g. cascading error from POS onwards (5.2.1)
*** higher accuracy and stricter filtering at each stage. make these parameters tweakable
*** higher accuracy and stricter filtering at each stage. make these parameters tweakable [tweakable params]

* simply implement more of the available methods (5.2.1)

Expand Down Expand Up @@ -66,7 +66,7 @@ Evidence sources
** (5.2.2) självrapporterade förekomsten av hjärtinfarkt
** (5.3.2) hyponymy expressed in many forms in language

* awareness of intended application of ontology (1.3, 2.3)
* awareness of intended application of ontology (1.3, 2.3) [tweaking params]
** could streamline extraction
** could guide construction
** ontology patterns might help
Expand Down Expand Up @@ -139,6 +139,8 @@ Framework extension proposal
*** and what format the values are in
** Speed up iterative development, method/evidence combination (2.3)
* dependency management
* it was the hardcoded for english and then hacked for spanish which made text2onto hard to reuse for swedish
* construction support tools must be decoupled from candidates and plugable to support experimentation there

Tool improvements
-----------------
Expand Down
34 changes: 34 additions & 0 deletions further_work_outline.txt
@@ -0,0 +1,34 @@
* preprocessing
* evidence sources and candidate extraction
* construction
* evaluation
* ui/tool improvements
* framework extension

1.1 \ref{sec:intro:problem}
1.2 \ref{sec:intro:objective}
1.3 \ref{sec:intro:delimitations}

2.1.9 \ref{subsec:background:machine_reading}
2.2 \ref{sec:lit-rev:immediate}
2.2.2 \ref{sec:background:corp_mgmt}
2.2.3 \ref{sec:lit-rev:preproc}
2.2.4 \ref{subsec:background:info_extraction}
2.2.5 skip
2.2.6 \ref{sec:background:eval}
2.2.8 \ref{subsec:background:ui}
2.3 \ref{sec:background:open-areas}

4.1 \ref{sec:results:design}
4.2 \ref{sec:results:proto:prepr}
4.2.1 \ref{subsec:results:term_cand_ling_filt}
4.5.1 \ref{subsec:results:deps}
4.5.4 \ref{subsec:results:usage}

5.1 \ref{sec:eval:setup}
5.2.1 \ref{sec:results:eval:cands}
5.2.2 \ref{sec:results:eval:cands_manual}
5.3 \ref{sec:analysis}
5.3.1 \ref{subsec:analysis:concepts}
5.3.2 \ref{subsec:analysis:subconcepts}
5.3.3 \ref{subsec:analysis:labeled_rels}
26 changes: 23 additions & 3 deletions report.tex
Expand Up @@ -76,6 +76,7 @@ \chapter{Introduction}
Some systems attempt to automatically construct ontologies in formats ready for application in semantically-enabled software, while others present the evidence as an aid to human experts who can then build such ontologies with significantly-reduced effort compared to a manual approach.

\section{Problem}
\label{sec:intro:problem}
%ONE-SENT Hoewever, there is no Ontology Learning System for learning ontologies from Swedish text.

Most ontology learning research forcuses on English language.
Expand All @@ -87,6 +88,7 @@ \section{Problem}
The problem is therefore that ontology learning should be extended to support Swedish text, making use of existing research in natural language processing and information extraction for Swedish where necessary.

\section{Objective}
\label{sec:intro:objective}

The objective of this thesis is to develop a system for ontology learning from Swedish text.
Given the time constraint, only a prototype will be built which will apply a small selection of methods, with the objectives of
Expand All @@ -97,6 +99,7 @@ \section{Objective}
\end{enumerate}

\section{Delimitations}
\label{sec:intro:delimitations}

The restriction to a small selection of methods means that only certain kinds of concepts and relations can be extracted and important things might be missed.
This is because certain methods are only suited to particular syntactic or semantic forms.
Expand Down Expand Up @@ -271,6 +274,7 @@ \subsection{Ontology Learning from Text}


\subsection{Machine Reading}
\label{subsec:background:machine_reading}

Machine Reading is the automatic, unsupervised \emph{'understanding'} of text where understanding means formation of beliefs supporting some level of reasoning from a textual corpus \citep{EtzioniEtAll06MachineReading}.
Machine Reading is distinguished from Information Retrieval where this is done in a highly supervised, manual manner - for example where patterns for extracting desired entities are hand-written or manually selected from an extracted list.
Expand Down Expand Up @@ -446,6 +450,7 @@ \subsubsection{The SVENSK language processing toolbox for Swedish}
SVENSK is a language processing toolbox for Swedish developed in the late 1990s and 2000\cite{OlssonGamback00SVENSK}. SVENSK aimed to support research and teaching which depended on Swedish language processing by providing common text processing tools such as taggers and parsers in a general purpose language processing framework. SVENSK was based on the GATE language processing framework.

\subsection{Information Extraction}
\label{subsec:background:info_extraction}

Information Extraction is the task of extracting structured information from a corpus of text.
The units of information generally extracted are terms, concepts, attributes, relations and axioms.
Expand Down Expand Up @@ -640,6 +645,7 @@ \subsection{Change Management}
In Ontology Engineering, that means making and tracking the changes to the ontology during construction and ongoing maintenance.

\subsection{User Interaction}
\label{subsec:background:ui}

User interfaces need to support non-ontology-engineer users in selecting and configuring appropriate methods, and then help them access important subsets of a potentially large amount of information extracted.
User interfaces can further help understanding the evidence for parts of the ontology and visualise the ontology's structure.
Expand Down Expand Up @@ -954,6 +960,7 @@ \chapter{Results}
The design is shown in figure~\ref{fig:prototype-design} and described in the next section.

\section{Design}
\label{sec:results:design}

\begin{figure}[H]
\includegraphics[width=10cm]{graphics/protege-plugin-components-simple.png}
Expand Down Expand Up @@ -994,6 +1001,7 @@ \section{Preprocessing}
The term candidate JAPE rule is described in the next section, while the labeled relation JAPE rule is described in Section~\ref{sec:results:proto:cands}.

\subsection{Linguistic Filter for Term Candidates}
\label{subsec:results:term_cand_ling_filt}

For initial potential terms to be selected by the C-Value method, a linguistic filter of zero or more adjectives, followed by one or more nouns, is used.
This filter was depicted as Adj*Noun+ in \cite{Frantzi98CNCValue} by analogy to regular expressions over parts of speech.
Expand Down Expand Up @@ -1144,6 +1152,7 @@ \section{Walkthrough}
This section shows how to install and use the prototype while demonstrating it with a sample corpus.

\subsection{Dependencies}
\label{subsec:results:deps}

\begin{itemize}
\item Korp Corpus Pipeline\footnote{http://spraakbanken.gu.se/eng/research/infrastructure/korp/distribution/corpuspipeline}
Expand Down Expand Up @@ -1186,6 +1195,8 @@ \subsection{OL Prototype plugin}
The source code for the plugin is available at the project source repository\footnote{https://github.com/downloads/jbothma/ontology-learning-protege}.

\subsection{Usage}
\label{subsec:results:usage}

\begin{figure}[H]
\centering
\includegraphics[width=.69\textwidth]{graphics/1._Open_Protege.png}
Expand Down Expand Up @@ -1267,6 +1278,7 @@ \chapter{Results evaluation}
\label{chap:eval}

\section{Evaluation Setup}
\label{sec:eval:setup}

The evaluation was performed on a corpus of articles available at \url{http://lakartidningen.se/}.
All articles available in HTML format categorised under Cardiovascular Disease were selected, as this was one of the largest categories denoting a specific subdomain of medicine available in Swedish in the HTML format.
Expand Down Expand Up @@ -1390,6 +1402,7 @@ \subsection{Automatically-extracted candidates}
\end{figure}

\subsection{Manually-extracted candidates}
\label{sec:results:eval:cands_manual}

Ontology elements were extracted manually from a single randomly-chosen article from the corpus used in the evaluation.
The article consisted of 349 words in 22 sentences.
Expand Down Expand Up @@ -1453,6 +1466,7 @@ \section{Analysis and discussion}


\subsection{Concept extraction and recommendation}
\label{subsec:analysis:concepts}

Most of the concept candidates extracted do appear to be useful concepts for modelling the domain, but most of them are more general than the domain at hand.
Some way of automatically identifying these more-general concepts would be useful to avoid duplicate definitions in OL for similar domains and to improve interoperability of ontologies.
Expand All @@ -1476,6 +1490,7 @@ \subsection{Concept extraction and recommendation}
An example of how the C-value ranking is working well is with the term "typ 2-diabetes". Both the full term and "2-diabetes" were selected as concepts, but the ranking for "typ 2-diabetes" is very high, while the ranking for "2-diabetes" is very low. This is because the latter occurs frequently, but only as part of the complete term.

\subsection{Subconcept relation extraction and recommendation}
\label{subsec:analysis:subconcepts}

The result of 43\% of subconcept candidates being correct is rather disappointing, but the majority of the mistakes can be excluded using simple filters.
Subclass relations to the concept itself can easily be filtered out by comparing the subject and object.
Expand Down Expand Up @@ -1506,9 +1521,14 @@ \chapter{Further work}
The third objective of this project is to identify issues for the further development of a tool for ontology learning from Swedish text such as that prototyped as part of this project.
These issues are enumerated here, based on the implementation choices described in chapter~\ref{chap:results} and the evaluation findings in chapter~\ref{chap:eval}.

\begin{itemize}
\item Improve efficiency to scale better with corpus size
\end{itemize}
\section{Preprocessing}

\ref{sec:results:proto:prepr}
\section{Evidence sources and candidate extraction}
\section{Ontology Construction}
\section{Evaluation}
\section{User interface and tool improvements}
\section{Framework extension}

This tool successfully integrated the GATE, Protege and Korp frameworks.
Further research into and practice of ontology engineering depends on being able to integrate, select, and combine the results of a multitude of methods throughout the OL pipeline.
Expand Down

0 comments on commit aaec9fb

Please sign in to comment.