New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"TypeError: expected str, bytes or os.PathLike object, not NoneType" about Stanford NLP #1808

Closed
scottming opened this Issue Aug 14, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@scottming

scottming commented Aug 14, 2017

Environment

System: Mac OS X 10.11.6
Python: 3.6.1
nltk: 3.2.4

Stanford NLP files

$ ls -l
total 1608960
-rwxr-xr-x   1 Scott  staff    2162000 Aug 14 15:29 chinesePCFG.ser.gz
drwxr-xr-x   6 Scott  staff        204 Aug 14 15:01 stanford-chinese-corenlp-2017-06-09-models
-rwxrwxrwx   1 Scott  staff  821613963 Aug 14 14:44 stanford-chinese-corenlp-2017-06-09-models.jar
drwxr-xr-x  22 Scott  staff        748 Aug 14 15:17 stanford-ner-2017-06-09
drwxr-xr-x  33 Scott  staff       1122 Aug 14 15:35 stanford-parser-full-2017-06-09
drwxr-xr-x  20 Scott  staff        680 Aug 14 15:35 stanford-postagger-full-2017-06-09
drwxr-xr-x  15 Scott  staff        510 Aug  6 15:49 stanford-segmenter-2017-06-09
-rw-r--r--   1 Scott  staff        559 Aug 14 16:06 test.py
-rw-r--r--   1 Scott  staff        800 Aug 14 15:41 test2.py

chinese.misc.distsim.crf.ser.gz and chinesePCFG.ser.gz are extracted from stanford-chinese-corenlp-2017-06-09-models.

code

test.py

# -*- coding: utf-8 -*-

from nltk.tokenize import StanfordSegmenter

segmenter = StanfordSegmenter(
    path_to_jar='./stanford-segmenter-2017-06-09/stanford-segmenter-3.8.0.jar',
    path_to_slf4j='./stanford-parser-full-2017-06-09/slf4j-api.jar',
    path_to_sihan_corpora_dict='./stanford-segmenter-2017-06-09/data',
    path_to_model='./stanford-segmenter-2017-06-09/data/pku.gz',
    path_to_dict='./stanford-segmenter-2017-06-09/data/dict-chris6.ser.gz')

res = segmenter.segment('北海已成为中国对外开放中升起的一颗明星')
print(res)

Error

$ python test.py
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    res = segmenter.segment('北海已成为中国对外开放中升起的一颗明星')
  File "/usr/local/var/pyenv/versions/3.6.1/lib/python3.6/site-packages/nltk/tokenize/stanford_segmenter.py", line 164, in segment
    return self.segment_sents([tokens])
  File "/usr/local/var/pyenv/versions/3.6.1/lib/python3.6/site-packages/nltk/tokenize/stanford_segmenter.py", line 192, in segment_sents
    stdout = self._execute(cmd)
  File "/usr/local/var/pyenv/versions/3.6.1/lib/python3.6/site-packages/nltk/tokenize/stanford_segmenter.py", line 211, in _execute
    stdout, _stderr = java(cmd, classpath=self._stanford_jar, stdout=PIPE, stderr=PIPE)
  File "/usr/local/var/pyenv/versions/3.6.1/lib/python3.6/site-packages/nltk/internals.py", line 129, in java
    p = subprocess.Popen(cmd, stdin=stdin, stdout=stdout, stderr=stderr)
  File "/usr/local/var/pyenv/versions/3.6.1/lib/python3.6/subprocess.py", line 707, in __init__
    restore_signals, start_new_session)
  File "/usr/local/var/pyenv/versions/3.6.1/lib/python3.6/subprocess.py", line 1260, in _execute_child
    restore_signals, start_new_session, preexec_fn)
TypeError: expected str, bytes or os.PathLike object, not NoneType

python2.7.12

$ python test2.py
Traceback (most recent call last):
  File "test2.py", line 12, in <module>
    res = segmenter.segment(u'北海已成为中国对外开放中升起的一颗明星')
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/site-packages/nltk/tokenize/stanford_segmenter.py", line 164, in segment
    return self.segment_sents([tokens])
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/site-packages/nltk/tokenize/stanford_segmenter.py", line 192, in segment_sents
    stdout = self._execute(cmd)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/site-packages/nltk/tokenize/stanford_segmenter.py", line 211, in _execute
    stdout, _stderr = java(cmd, classpath=self._stanford_jar, stdout=PIPE, stderr=PIPE)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/site-packages/nltk/internals.py", line 129, in java
    p = subprocess.Popen(cmd, stdin=stdin, stdout=stdout, stderr=stderr)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/subprocess.py", line 1343, in _execute_child
    raise child_exception
TypeError: execv() arg 2 must contain only strings

I don't know what happened, I tested it in python3.6.1 and python2.7.12, both not work. Please help!

@alvations

This comment has been minimized.

Show comment
Hide comment
@alvations

alvations Aug 14, 2017

Contributor

@scottming Please take a look at the answer on https://stackoverflow.com/questions/45663121/about-stanford-word-segmenter/45668849

The old StanfordSegmenter code would be deprecated soon, it is advisable to use the new interface which will remove the arcane settings needed to setup the environment of the Stanford tools.

First start the server as such:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-chinese.properties \
-preload tokenize,ssplit,pos,lemma,ner,parse \
-status_port 9001  -port 9001 -timeout 15000

With NLTK v3.2.4:

>>> from nltk.parse.corenlp import CoreNLPParser 
>>> corenlp_parser = CoreNLPParser('http://localhost:9001', encoding='utf8')
>>> text = u'我家没有电脑。'
>>> result = corenlp_parser.api_call(text, {'annotators': 'tokenize,ssplit'})
>>> tokens = [token['originalText'] or token['word'] for sentence in result['sentences'] for token in sentence['tokens']]
['我家', '没有', '电脑', '']

The new interface in the upcoming release v3.2.5 would be much simpler:

# After starting the CoreNLP server, in Python, do this:
>>> from nltk.tokenize.stanford import CoreNLPTokenizer
>>> sttok = CoreNLPTokenizer('http://localhost:9001')
>>> sttok.tokenize(u'我家没有电脑。')
['我家', '没有', '电脑', '']
Contributor

alvations commented Aug 14, 2017

@scottming Please take a look at the answer on https://stackoverflow.com/questions/45663121/about-stanford-word-segmenter/45668849

The old StanfordSegmenter code would be deprecated soon, it is advisable to use the new interface which will remove the arcane settings needed to setup the environment of the Stanford tools.

First start the server as such:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-chinese.properties \
-preload tokenize,ssplit,pos,lemma,ner,parse \
-status_port 9001  -port 9001 -timeout 15000

With NLTK v3.2.4:

>>> from nltk.parse.corenlp import CoreNLPParser 
>>> corenlp_parser = CoreNLPParser('http://localhost:9001', encoding='utf8')
>>> text = u'我家没有电脑。'
>>> result = corenlp_parser.api_call(text, {'annotators': 'tokenize,ssplit'})
>>> tokens = [token['originalText'] or token['word'] for sentence in result['sentences'] for token in sentence['tokens']]
['我家', '没有', '电脑', '']

The new interface in the upcoming release v3.2.5 would be much simpler:

# After starting the CoreNLP server, in Python, do this:
>>> from nltk.tokenize.stanford import CoreNLPTokenizer
>>> sttok = CoreNLPTokenizer('http://localhost:9001')
>>> sttok.tokenize(u'我家没有电脑。')
['我家', '没有', '电脑', '']

@alvations alvations added this to the 3.2.5 milestone Aug 14, 2017

@scottming

This comment has been minimized.

Show comment
Hide comment
@scottming

scottming Aug 14, 2017

Thanks,It's awesome. And when can I install v3.2.5? Can I use that api to do Part-of-Speech or Dependencies with v3.2.4? Could you give me some more references?

scottming commented Aug 14, 2017

Thanks,It's awesome. And when can I install v3.2.5? Can I use that api to do Part-of-Speech or Dependencies with v3.2.4? Could you give me some more references?

@alvations

This comment has been minimized.

Show comment
Hide comment
@alvations

alvations Aug 14, 2017

Contributor

v3.2.5 will be released in a couple of weeks (fingers-crossed) =)

For now, please refer to #1735 (comment) for the details. There's POS and NER tagging information there.

For dependency parsing example, please see https://github.com/nltk/nltk/blob/develop/nltk/parse/corenlp.py#L495 But I'ven't tried out other languages for the parser models, maybe @dimazest has a better idea on that.


The interface for Chinese CoreNLP looks almost the same using the same CoreNLPDependencyParser API:

# After starting the CoreNLP server on the terminal, in Python, do this:
>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> st = CoreNLPDependencyParser('http://localhost:9001', encoding='utf8')

>>> st.raw_parse(u'我家没有电脑。')
<list_iterator object at 0x10fdb0ac8>

>>> parses = st.raw_parse(u'我家没有电脑。')
>>> type(next(parses))
<class 'nltk.parse.dependencygraph.DependencyGraph'>

>>> parses = st.raw_parse(u'我家没有电脑。')
>>> print(next(parses).to_conll(4))
我家	NN	2	dep
没有	VE	0	ROOT
电脑	NN	2	dobj
。	PU	2	punct


>>> print(next(parses))
defaultdict(<function DependencyGraph.__init__.<locals>.<lambda> at 0x10598b598>,
            {0: {'address': 0,
                 'ctag': 'TOP',
                 'deps': defaultdict(<class 'list'>, {'ROOT': [2]}),
                 'feats': None,
                 'head': None,
                 'lemma': None,
                 'rel': None,
                 'tag': 'TOP',
                 'word': None},
             1: {'address': 1,
                 'ctag': 'NN',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 2,
                 'lemma': '我家',
                 'rel': 'dep',
                 'tag': 'NN',
                 'word': '我家'},
             2: {'address': 2,
                 'ctag': 'VE',
                 'deps': defaultdict(<class 'list'>,
                                     {'dep': [1],
                                      'dobj': [3],
                                      'punct': [4]}),
                 'feats': '_',
                 'head': 0,
                 'lemma': '没有',
                 'rel': 'ROOT',
                 'tag': 'VE',
                 'word': '没有'},
             3: {'address': 3,
                 'ctag': 'NN',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 2,
                 'lemma': '电脑',
                 'rel': 'dobj',
                 'tag': 'NN',
                 'word': '电脑'},
             4: {'address': 4,
                 'ctag': 'PU',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 2,
                 'lemma': '',
                 'rel': 'punct',
                 'tag': 'PU',
                 'word': ''}})
Contributor

alvations commented Aug 14, 2017

v3.2.5 will be released in a couple of weeks (fingers-crossed) =)

For now, please refer to #1735 (comment) for the details. There's POS and NER tagging information there.

For dependency parsing example, please see https://github.com/nltk/nltk/blob/develop/nltk/parse/corenlp.py#L495 But I'ven't tried out other languages for the parser models, maybe @dimazest has a better idea on that.


The interface for Chinese CoreNLP looks almost the same using the same CoreNLPDependencyParser API:

# After starting the CoreNLP server on the terminal, in Python, do this:
>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> st = CoreNLPDependencyParser('http://localhost:9001', encoding='utf8')

>>> st.raw_parse(u'我家没有电脑。')
<list_iterator object at 0x10fdb0ac8>

>>> parses = st.raw_parse(u'我家没有电脑。')
>>> type(next(parses))
<class 'nltk.parse.dependencygraph.DependencyGraph'>

>>> parses = st.raw_parse(u'我家没有电脑。')
>>> print(next(parses).to_conll(4))
我家	NN	2	dep
没有	VE	0	ROOT
电脑	NN	2	dobj
。	PU	2	punct


>>> print(next(parses))
defaultdict(<function DependencyGraph.__init__.<locals>.<lambda> at 0x10598b598>,
            {0: {'address': 0,
                 'ctag': 'TOP',
                 'deps': defaultdict(<class 'list'>, {'ROOT': [2]}),
                 'feats': None,
                 'head': None,
                 'lemma': None,
                 'rel': None,
                 'tag': 'TOP',
                 'word': None},
             1: {'address': 1,
                 'ctag': 'NN',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 2,
                 'lemma': '我家',
                 'rel': 'dep',
                 'tag': 'NN',
                 'word': '我家'},
             2: {'address': 2,
                 'ctag': 'VE',
                 'deps': defaultdict(<class 'list'>,
                                     {'dep': [1],
                                      'dobj': [3],
                                      'punct': [4]}),
                 'feats': '_',
                 'head': 0,
                 'lemma': '没有',
                 'rel': 'ROOT',
                 'tag': 'VE',
                 'word': '没有'},
             3: {'address': 3,
                 'ctag': 'NN',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 2,
                 'lemma': '电脑',
                 'rel': 'dobj',
                 'tag': 'NN',
                 'word': '电脑'},
             4: {'address': 4,
                 'ctag': 'PU',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 2,
                 'lemma': '',
                 'rel': 'punct',
                 'tag': 'PU',
                 'word': ''}})

@scottming scottming closed this Aug 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment