neural models

mikahama · May 14, 2021 · 808c9b0 · 808c9b0
1 parent c8e9f6e
commit 808c9b0
Show file tree

Hide file tree

Showing 7 changed files with 190 additions and 242 deletions.
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,6 +1,8 @@
 # UralicNLP
 
-[![Build Status](https://travis-ci.com/mikahama/uralicNLP.svg?branch=master)](https://travis-ci.com/mikahama/uralicNLP) [![Updates](https://pyup.io/repos/github/mikahama/uralicNLP/shield.svg)](https://pyup.io/repos/github/mikahama/uralicNLP/) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1143638.svg)](https://doi.org/10.5281/zenodo.1143638) [![Downloads](https://pepy.tech/badge/uralicnlp)](https://pepy.tech/project/uralicnlp) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01345/status.svg)](https://doi.org/10.21105/joss.01345)
+[![Build Status](https://travis-ci.com/mikahama/uralicNLP.svg?branch=master)](https://travis-ci.com/mikahama/uralicNLP) [![Updates](https://pyup.io/repos/github/mikahama/uralicNLP/shield.svg)](https://pyup.io/repos/github/mikahama/uralicNLP/)  [![Downloads](https://pepy.tech/badge/uralicnlp)](https://pepy.tech/project/uralicnlp) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01345/status.svg)](https://doi.org/10.21105/joss.01345)
+
+![CC BY NC ND](https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png)
 
 UralicNLP is a natural language processing library targeted mainly for Uralic languages.
 
@@ -66,7 +68,7 @@ A word form can be lemmatized with UralicNLP. This does not do any disambiguatio
     uralicApi.lemmatize("luutapiiri", "fin", word_boundaries=True)
     >>['luuta|piiri', 'luu|tapiiri']
 
-An example of lemmatizing the word *вирев* in Erzya (myv). By default, a **descriptive** analyzer is used. Use *uralicApi.lemmatize("вирев", "myv", descrpitive=False)* for a non-descriptive analyzer. If *word_boundaries* is set to True, the lemmatizer will mark word boundaries with a |. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)
+An example of lemmatizing the word *вирев* in Erzya (myv). By default, a **descriptive** analyzer is used. Use *uralicApi.lemmatize("вирев", "myv", descriptive=False)* for a non-descriptive analyzer. If *word_boundaries* is set to True, the lemmatizer will mark word boundaries with a |. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)
 
 ### Morphological analysis
 Apart from just getting the lemmas, it's also possible to perform a complete morphological analysis.
@@ -75,7 +77,7 @@ Apart from just getting the lemmas, it's also possible to perform a complete mor
     uralicApi.analyze("voita", "fin")
     >>[['voi+N+Sg+Par', 0.0], ['voi+N+Pl+Par', 0.0], ['voitaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voitaa+V+Act+Imprt+Sg2', 0.0], ['voitaa+V+Act+Ind+Prs+ConNeg', 0.0], ['voittaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voittaa+V+Act+Imprt+Sg2', 0.0], ['voittaa+V+Act+Ind+Prs+ConNeg', 0.0], ['vuo+N+Pl+Par', 0.0]]
 
-An example of analyzing the word *voita* in Finnish (fin). The default analyzer is **descriptive**. To use a normative analyzer instead, use *uralicApi.analyze("voita", "fin", descrpitive=False)*. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)
+An example of analyzing the word *voita* in Finnish (fin). The default analyzer is **descriptive**. To use a normative analyzer instead, use *uralicApi.analyze("voita", "fin", descriptive=False)*. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)
 
 ### Morphological generation
 From a lemma and a morphological analysis, it's possible to generate the desired word form. 
@@ -84,7 +86,7 @@ From a lemma and a morphological analysis, it's possible to generate the desired
     uralicApi.generate("käsi+N+Sg+Par", "fin")
     >>[['kättä', 0.0]]
 
-An example of generating the singular partitive form for the Finnish noun *käsi*. The result is *kättä*. The default generator is a **regular normative** generator. *uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=True)* uses a normative dictionary generator and *uralicApi.generate("käsi+N+Sg+Par", "fin", descrpitive=True)* a descriptive generator. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)
+An example of generating the singular partitive form for the Finnish noun *käsi*. The result is *kättä*. The default generator is a **regular normative** generator. *uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=True)* uses a normative dictionary generator and *uralicApi.generate("käsi+N+Sg+Par", "fin", descriptive=True)* a descriptive generator. [You can also use your own transducer](https://github.com/mikahama/uralicNLP/wiki/Models#using-your-own-transducers)
 
 
 ### Access the HFST transducer
@@ -95,7 +97,7 @@ If you need to get a lower level access to [the HFST transducer object](https://
     sms_generator = uralicApi.get_transducer("sms", analyzer=False) #generator
     sms_analyzer = uralicApi.get_transducer("sms", analyzer=True) #analyzer
 
-The same parameters can be used here as for *generate()* and *analyze()* to specify whether you want to use the normative or descriptive analyzers and so on. The defaults are *get_transducer(language, cache=True, analyzer=True, descrpitive=True, dictionary_forms=True)*.
+The same parameters can be used here as for *generate()* and *analyze()* to specify whether you want to use the normative or descriptive analyzers and so on. The defaults are *get_transducer(language, cache=True, analyzer=True, descriptive=True, dictionary_forms=True)*.
 
 ### Syntax - Constraint Grammar disambiguation
 
@@ -122,7 +124,7 @@ The return object is a list of tuples. The first item in each tuple is the word
 
 The *cg.disambiguate* takes in *remove_symbols* as an optional argument. Its default value is *True* which means that it removes the symbols (segments surrounded by @) from the FST output before feeding it to the CG disambiguator. If the value is set to *False*, the FST morphology is fed in to the CG unmodified.
 
-The **default FST analyzer is a descriptive one**, to use a normative analyzer, set the *descriptive* parameter to False *cg.disambiguate(tokens,descrpitive=False)*.
+The **default FST analyzer is a descriptive one**, to use a normative analyzer, set the *descriptive* parameter to False *cg.disambiguate(tokens,descriptive=False)*.
 
 #### Multilingual CG
 

diff --git a/setup.py b/setup.py
@@ -23,7 +23,7 @@
     # Versions should comply with PEP440.  For a discussion on single-sourcing
     # the version across setup.py and the project code, see
     # https://packaging.python.org/en/latest/single_source_version.html
-    version='1.2.3',
+    version='1.3.0',
 
     description='An NLP library for Uralic languages such as Finnish and Sami. Also supports Arabic, Russian etc.',
     long_description=long_description,
@@ -37,7 +37,7 @@
     author_email='mika.hamalainen@helsinki.fi',
 
     # Choose your license
-    license='Apache License, Version 2.0',
+    license='Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License',
 
     # See https://pypi.python.org/pypi?%3Aaction=list_classifiers
     classifiers=[

diff --git a/test_uralicnlp.py b/test_uralicnlp.py
@@ -8,7 +8,7 @@
 import re
 from mikatools import *
 
-uralicApi.get_all_forms("kissa", "N", "fin")
+#uralicApi.get_all_forms("kissa", "N", "fin")
 
 #uralicApi.get_transducer("spa", analyzer=True).lookup_optimize()
 #print(uralicApi.analyze("hola", "spa"))
@@ -19,32 +19,32 @@
 #uralicApi.download("fin")
 """
 print(uralicApi.analyze("voita", "fin"))
-print(uralicApi.analyze("voita", "fin", descrpitive=False))
+print(uralicApi.analyze("voita", "fin", descriptive=False))
 print(uralicApi.analyze("voita", "fin"))
-print(uralicApi.analyze("voita", "fin", descrpitive=False))
+print(uralicApi.analyze("voita", "fin", descriptive=False))
 
 
 
 print(uralicApi.generate("käsi+N+Sg+Par", "fin"))
 print(uralicApi.generate("käsi+N+Sg+Par", "fin"))
-print(uralicApi.generate("käsi+N+Sg+Par", "fin", descrpitive=True))
-print(uralicApi.generate("käsi+N+Sg+Par", "fin", descrpitive=True))
+print(uralicApi.generate("käsi+N+Sg+Par", "fin", descriptive=True))
+print(uralicApi.generate("käsi+N+Sg+Par", "fin", descriptive=True))
 print(uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=False))
 print(uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=False))
 
 print(uralicApi.generate("käsi+N+Sg+Par", "deu"))
 
 #print(uralicApi.dictionary_search("car", "sms"))
 
-print(uralicApi.lemmatize("voita", "fin", descrpitive=True))
+print(uralicApi.lemmatize("voita", "fin", descriptive=True))
 
 
 #uralicApi.download("kpv")
 
 """
 """
 cg = Cg3("fin")
-print(cg.disambiguate(["Kissa","voi","nauraa", "."], descrpitive=True))
+print(cg.disambiguate(["Kissa","voi","nauraa", "."], descriptive=True))
 
 
 cg = Cg3("kpv")
@@ -113,9 +113,19 @@
 		print word.pos, word.lemma, word.get_attribute("deprel")
 	print "---"
 """
+
+"""
 ud = UD_collection(open_read("test_data/fi_test.conllu"))
 sentences = ud.find_sentences(query={"lemma": "olla"}) #finds all sentences with the lemma kissa
 
 for sentence in sentences:
     word = sentence.find(query={"lemma": "olla"})
-    print(word[0].get_attribute("form"))
+    print(word[0].get_attribute("form"))
+
+"""
+
+print(uralicApi.analyze("hörpähdin", "fin", neural_fallback=True))
+print(uralicApi.lemmatize("nirhautan", "fin", neural_fallback=True))
+print(uralicApi.generate("hömpötti+N+Sg+Gen", "fin", neural_fallback=True))
+print(uralicApi.generate("koirailla+V+Act+Ind+Prs+Sg1", "fin", neural_fallback=True))
+print(uralicApi.analyze("juoksen", "fin", neural_fallback=True))
diff --git a/uralicNLP/cg3.py b/uralicNLP/cg3.py
@@ -7,22 +7,22 @@
 import copy
 import re
 
-def _Cg3__parse_sentence(words, language, morphology_ignore_after=None, descrpitive=True,remove_symbols=True, language_flags=False, words_analysis=None):
+def _Cg3__parse_sentence(words, language, morphology_ignore_after=None, descriptive=True,remove_symbols=True, language_flags=False, words_analysis=None,neural_fallback=False):
 	sentence = []
 	if words_analysis is not None and len(words_analysis) < len(words):
 		words_analysis = words_analysis + [[]]
 	for i, word in enumerate(words):
 		existing_analysis = None
 		if words_analysis is not None:
 			existing_analysis = words_analysis[i]
-		analysis = __hfst_format(word, language, morphology_ignore_after,descrpitive=descrpitive, remove_symbols=remove_symbols, language_flags=language_flags, analysis=existing_analysis)
+		analysis = __hfst_format(word, language, morphology_ignore_after,descriptive=descriptive, remove_symbols=remove_symbols, language_flags=language_flags, analysis=existing_analysis,neural_fallback=neural_fallback)
 		sentence.extend(analysis)
 	hfst_result_string = "\n".join(sentence)
 	return hfst_result_string
 
-def __hfst_format(word, language, morphology_ignore_after=None, descrpitive=True,remove_symbols=True, language_flags=False, analysis=None):
+def __hfst_format(word, language, morphology_ignore_after=None, descriptive=True,remove_symbols=True, language_flags=False, analysis=None,neural_fallback=False):
 	if analysis is None:
-		analysis = uralic_api_analyze(word, language,descrpitive=descrpitive,remove_symbols=remove_symbols, language_flags=language_flags)
+		analysis = uralic_api_analyze(word, language,descriptive=descriptive,remove_symbols=remove_symbols, language_flags=language_flags,neural_fallback=neural_fallback)
 	hfsts = []
 	if len(analysis) == 0:
 		hfsts.append(word + "\t" +word + "+?\tinf")
@@ -46,8 +46,8 @@ def __init__(self, language, morphology_languages=None):
 		self.cg_path = cg_path
 		self.language = language
 
-	def disambiguate(self, words, morphology_ignore_after=None,descrpitive=True,remove_symbols=True, temp_file=None, language_flags=False, morphologies=None):
-		hfst_output = __parse_sentence(words + [""], self.morphology_languages, morphology_ignore_after, descrpitive=descrpitive,remove_symbols=remove_symbols, language_flags=language_flags, words_analysis=morphologies)
+	def disambiguate(self, words, morphology_ignore_after=None,descriptive=True,remove_symbols=True, temp_file=None, language_flags=False, morphologies=None, neural_fallback=False):
+		hfst_output = __parse_sentence(words + [""], self.morphology_languages, morphology_ignore_after, descriptive=descriptive,remove_symbols=remove_symbols, language_flags=language_flags, words_analysis=morphologies, neural_fallback=neural_fallback)
 		if temp_file is None:
 			p1 = Popen(["echo", hfst_output], stdout=PIPE)
 		else:

diff --git a/uralicNLP/neural_fst.py b/uralicNLP/neural_fst.py
@@ -0,0 +1,33 @@
+try:
+	from natas.normalize import call_onmt
+except:
+	call_onmt = None
+import os
+class NatasNotInstalled(Exception):
+	pass
+
+class NeuralFST(object):
+	"""docstring for NeuralFST"""
+	def __init__(self, model_path):
+		if call_onmt is None:
+			raise NatasNotInstalled("Natas is needed for neural models, run:\n\npip install natas")
+
+		self.model_path = model_path
+
+	def analyze(self, word):
+		if len(word) == 0:
+			return []
+		model_a = os.path.join(self.model_path, "analyzer.pt")
+		model_l = os.path.join(self.model_path, "lemmatizer.pt")
+		tags = call_onmt([" ".join(word)] ,model_a,n_best=1)[0][0].replace(" ", "+")
+		lemma = call_onmt([" ".join(word)], model_l,n_best=1)[0][0].replace(" ", "")
+		return [(lemma + "+" + tags, 0.0)]
+
+	def generate(self, word):
+		if len(word) ==0:
+			return []
+		model_g = os.path.join(self.model_path, "generator.pt")
+		parts = word.split("+")
+		parts[0] = " ".join(parts[0])
+		form =  call_onmt([" ".join(parts)] ,model_g, n_best=1)[0][0].replace(" ", "")
+		return [(form, 0.0)]