Skip to content

Commit

Permalink
neural models
Browse files Browse the repository at this point in the history
  • Loading branch information
Mika committed May 14, 2021
1 parent c8e9f6e commit 808c9b0
Show file tree
Hide file tree
Showing 7 changed files with 190 additions and 242 deletions.
277 changes: 87 additions & 190 deletions LICENSE

Large diffs are not rendered by default.

14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# UralicNLP

[![Build Status](https://travis-ci.com/mikahama/uralicNLP.svg?branch=master)](https://travis-ci.com/mikahama/uralicNLP) [![Updates](https://pyup.io/repos/github/mikahama/uralicNLP/shield.svg)](https://pyup.io/repos/github/mikahama/uralicNLP/) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1143638.svg)](https://doi.org/10.5281/zenodo.1143638) [![Downloads](https://pepy.tech/badge/uralicnlp)](https://pepy.tech/project/uralicnlp) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01345/status.svg)](https://doi.org/10.21105/joss.01345)
[![Build Status](https://travis-ci.com/mikahama/uralicNLP.svg?branch=master)](https://travis-ci.com/mikahama/uralicNLP) [![Updates](https://pyup.io/repos/github/mikahama/uralicNLP/shield.svg)](https://pyup.io/repos/github/mikahama/uralicNLP/) [![Downloads](https://pepy.tech/badge/uralicnlp)](https://pepy.tech/project/uralicnlp) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01345/status.svg)](https://doi.org/10.21105/joss.01345)

![CC BY NC ND](https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png)

UralicNLP is a natural language processing library targeted mainly for Uralic languages.

Expand Down Expand Up @@ -66,7 +68,7 @@ A word form can be lemmatized with UralicNLP. This does not do any disambiguatio
uralicApi.lemmatize("luutapiiri", "fin", word_boundaries=True)
>>['luuta|piiri', 'luu|tapiiri']

An example of lemmatizing the word *вирев* in Erzya (myv). By default, a **descriptive** analyzer is used. Use *uralicApi.lemmatize("вирев", "myv", descrpitive=False)* for a non-descriptive analyzer. If *word_boundaries* is set to True, the lemmatizer will mark word boundaries with a |. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)
An example of lemmatizing the word *вирев* in Erzya (myv). By default, a **descriptive** analyzer is used. Use *uralicApi.lemmatize("вирев", "myv", descriptive=False)* for a non-descriptive analyzer. If *word_boundaries* is set to True, the lemmatizer will mark word boundaries with a |. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)

### Morphological analysis
Apart from just getting the lemmas, it's also possible to perform a complete morphological analysis.
Expand All @@ -75,7 +77,7 @@ Apart from just getting the lemmas, it's also possible to perform a complete mor
uralicApi.analyze("voita", "fin")
>>[['voi+N+Sg+Par', 0.0], ['voi+N+Pl+Par', 0.0], ['voitaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voitaa+V+Act+Imprt+Sg2', 0.0], ['voitaa+V+Act+Ind+Prs+ConNeg', 0.0], ['voittaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voittaa+V+Act+Imprt+Sg2', 0.0], ['voittaa+V+Act+Ind+Prs+ConNeg', 0.0], ['vuo+N+Pl+Par', 0.0]]

An example of analyzing the word *voita* in Finnish (fin). The default analyzer is **descriptive**. To use a normative analyzer instead, use *uralicApi.analyze("voita", "fin", descrpitive=False)*. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)
An example of analyzing the word *voita* in Finnish (fin). The default analyzer is **descriptive**. To use a normative analyzer instead, use *uralicApi.analyze("voita", "fin", descriptive=False)*. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)

### Morphological generation
From a lemma and a morphological analysis, it's possible to generate the desired word form.
Expand All @@ -84,7 +86,7 @@ From a lemma and a morphological analysis, it's possible to generate the desired
uralicApi.generate("käsi+N+Sg+Par", "fin")
>>[['kättä', 0.0]]

An example of generating the singular partitive form for the Finnish noun *käsi*. The result is *kättä*. The default generator is a **regular normative** generator. *uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=True)* uses a normative dictionary generator and *uralicApi.generate("käsi+N+Sg+Par", "fin", descrpitive=True)* a descriptive generator. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)
An example of generating the singular partitive form for the Finnish noun *käsi*. The result is *kättä*. The default generator is a **regular normative** generator. *uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=True)* uses a normative dictionary generator and *uralicApi.generate("käsi+N+Sg+Par", "fin", descriptive=True)* a descriptive generator. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)


### Access the HFST transducer
Expand All @@ -95,7 +97,7 @@ If you need to get a lower level access to [the HFST transducer object](https://
sms_generator = uralicApi.get_transducer("sms", analyzer=False) #generator
sms_analyzer = uralicApi.get_transducer("sms", analyzer=True) #analyzer

The same parameters can be used here as for *generate()* and *analyze()* to specify whether you want to use the normative or descriptive analyzers and so on. The defaults are *get_transducer(language, cache=True, analyzer=True, descrpitive=True, dictionary_forms=True)*.
The same parameters can be used here as for *generate()* and *analyze()* to specify whether you want to use the normative or descriptive analyzers and so on. The defaults are *get_transducer(language, cache=True, analyzer=True, descriptive=True, dictionary_forms=True)*.

### Syntax - Constraint Grammar disambiguation

Expand All @@ -122,7 +124,7 @@ The return object is a list of tuples. The first item in each tuple is the word

The *cg.disambiguate* takes in *remove_symbols* as an optional argument. Its default value is *True* which means that it removes the symbols (segments surrounded by @) from the FST output before feeding it to the CG disambiguator. If the value is set to *False*, the FST morphology is fed in to the CG unmodified.

The **default FST analyzer is a descriptive one**, to use a normative analyzer, set the *descriptive* parameter to False *cg.disambiguate(tokens,descrpitive=False)*.
The **default FST analyzer is a descriptive one**, to use a normative analyzer, set the *descriptive* parameter to False *cg.disambiguate(tokens,descriptive=False)*.

#### Multilingual CG

Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
# Versions should comply with PEP440. For a discussion on single-sourcing
# the version across setup.py and the project code, see
# https://packaging.python.org/en/latest/single_source_version.html
version='1.2.3',
version='1.3.0',

description='An NLP library for Uralic languages such as Finnish and Sami. Also supports Arabic, Russian etc.',
long_description=long_description,
Expand All @@ -37,7 +37,7 @@
author_email='mika.hamalainen@helsinki.fi',

# Choose your license
license='Apache License, Version 2.0',
license='Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License',

# See https://pypi.python.org/pypi?%3Aaction=list_classifiers
classifiers=[
Expand Down
26 changes: 18 additions & 8 deletions test_uralicnlp.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import re
from mikatools import *

uralicApi.get_all_forms("kissa", "N", "fin")
#uralicApi.get_all_forms("kissa", "N", "fin")

#uralicApi.get_transducer("spa", analyzer=True).lookup_optimize()
#print(uralicApi.analyze("hola", "spa"))
Expand All @@ -19,32 +19,32 @@
#uralicApi.download("fin")
"""
print(uralicApi.analyze("voita", "fin"))
print(uralicApi.analyze("voita", "fin", descrpitive=False))
print(uralicApi.analyze("voita", "fin", descriptive=False))
print(uralicApi.analyze("voita", "fin"))
print(uralicApi.analyze("voita", "fin", descrpitive=False))
print(uralicApi.analyze("voita", "fin", descriptive=False))
print(uralicApi.generate("käsi+N+Sg+Par", "fin"))
print(uralicApi.generate("käsi+N+Sg+Par", "fin"))
print(uralicApi.generate("käsi+N+Sg+Par", "fin", descrpitive=True))
print(uralicApi.generate("käsi+N+Sg+Par", "fin", descrpitive=True))
print(uralicApi.generate("käsi+N+Sg+Par", "fin", descriptive=True))
print(uralicApi.generate("käsi+N+Sg+Par", "fin", descriptive=True))
print(uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=False))
print(uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=False))
print(uralicApi.generate("käsi+N+Sg+Par", "deu"))
#print(uralicApi.dictionary_search("car", "sms"))
print(uralicApi.lemmatize("voita", "fin", descrpitive=True))
print(uralicApi.lemmatize("voita", "fin", descriptive=True))
#uralicApi.download("kpv")
"""
"""
cg = Cg3("fin")
print(cg.disambiguate(["Kissa","voi","nauraa", "."], descrpitive=True))
print(cg.disambiguate(["Kissa","voi","nauraa", "."], descriptive=True))
cg = Cg3("kpv")
Expand Down Expand Up @@ -113,9 +113,19 @@
print word.pos, word.lemma, word.get_attribute("deprel")
print "---"
"""

"""
ud = UD_collection(open_read("test_data/fi_test.conllu"))
sentences = ud.find_sentences(query={"lemma": "olla"}) #finds all sentences with the lemma kissa
for sentence in sentences:
word = sentence.find(query={"lemma": "olla"})
print(word[0].get_attribute("form"))
print(word[0].get_attribute("form"))
"""

print(uralicApi.analyze("hörpähdin", "fin", neural_fallback=True))
print(uralicApi.lemmatize("nirhautan", "fin", neural_fallback=True))
print(uralicApi.generate("hömpötti+N+Sg+Gen", "fin", neural_fallback=True))
print(uralicApi.generate("koirailla+V+Act+Ind+Prs+Sg1", "fin", neural_fallback=True))
print(uralicApi.analyze("juoksen", "fin", neural_fallback=True))
12 changes: 6 additions & 6 deletions uralicNLP/cg3.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,22 @@
import copy
import re

def _Cg3__parse_sentence(words, language, morphology_ignore_after=None, descrpitive=True,remove_symbols=True, language_flags=False, words_analysis=None):
def _Cg3__parse_sentence(words, language, morphology_ignore_after=None, descriptive=True,remove_symbols=True, language_flags=False, words_analysis=None,neural_fallback=False):
sentence = []
if words_analysis is not None and len(words_analysis) < len(words):
words_analysis = words_analysis + [[]]
for i, word in enumerate(words):
existing_analysis = None
if words_analysis is not None:
existing_analysis = words_analysis[i]
analysis = __hfst_format(word, language, morphology_ignore_after,descrpitive=descrpitive, remove_symbols=remove_symbols, language_flags=language_flags, analysis=existing_analysis)
analysis = __hfst_format(word, language, morphology_ignore_after,descriptive=descriptive, remove_symbols=remove_symbols, language_flags=language_flags, analysis=existing_analysis,neural_fallback=neural_fallback)
sentence.extend(analysis)
hfst_result_string = "\n".join(sentence)
return hfst_result_string

def __hfst_format(word, language, morphology_ignore_after=None, descrpitive=True,remove_symbols=True, language_flags=False, analysis=None):
def __hfst_format(word, language, morphology_ignore_after=None, descriptive=True,remove_symbols=True, language_flags=False, analysis=None,neural_fallback=False):
if analysis is None:
analysis = uralic_api_analyze(word, language,descrpitive=descrpitive,remove_symbols=remove_symbols, language_flags=language_flags)
analysis = uralic_api_analyze(word, language,descriptive=descriptive,remove_symbols=remove_symbols, language_flags=language_flags,neural_fallback=neural_fallback)
hfsts = []
if len(analysis) == 0:
hfsts.append(word + "\t" +word + "+?\tinf")
Expand All @@ -46,8 +46,8 @@ def __init__(self, language, morphology_languages=None):
self.cg_path = cg_path
self.language = language

def disambiguate(self, words, morphology_ignore_after=None,descrpitive=True,remove_symbols=True, temp_file=None, language_flags=False, morphologies=None):
hfst_output = __parse_sentence(words + [""], self.morphology_languages, morphology_ignore_after, descrpitive=descrpitive,remove_symbols=remove_symbols, language_flags=language_flags, words_analysis=morphologies)
def disambiguate(self, words, morphology_ignore_after=None,descriptive=True,remove_symbols=True, temp_file=None, language_flags=False, morphologies=None, neural_fallback=False):
hfst_output = __parse_sentence(words + [""], self.morphology_languages, morphology_ignore_after, descriptive=descriptive,remove_symbols=remove_symbols, language_flags=language_flags, words_analysis=morphologies, neural_fallback=neural_fallback)
if temp_file is None:
p1 = Popen(["echo", hfst_output], stdout=PIPE)
else:
Expand Down
33 changes: 33 additions & 0 deletions uralicNLP/neural_fst.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
try:
from natas.normalize import call_onmt
except:
call_onmt = None
import os
class NatasNotInstalled(Exception):
pass

class NeuralFST(object):
"""docstring for NeuralFST"""
def __init__(self, model_path):
if call_onmt is None:
raise NatasNotInstalled("Natas is needed for neural models, run:\n\npip install natas")

self.model_path = model_path

def analyze(self, word):
if len(word) == 0:
return []
model_a = os.path.join(self.model_path, "analyzer.pt")
model_l = os.path.join(self.model_path, "lemmatizer.pt")
tags = call_onmt([" ".join(word)] ,model_a,n_best=1)[0][0].replace(" ", "+")
lemma = call_onmt([" ".join(word)], model_l,n_best=1)[0][0].replace(" ", "")
return [(lemma + "+" + tags, 0.0)]

def generate(self, word):
if len(word) ==0:
return []
model_g = os.path.join(self.model_path, "generator.pt")
parts = word.split("+")
parts[0] = " ".join(parts[0])
form = call_onmt([" ".join(parts)] ,model_g, n_best=1)[0][0].replace(" ", "")
return [(form, 0.0)]

0 comments on commit 808c9b0

Please sign in to comment.