-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building own classifier based POS tagger using SklearnClassifier and ClassifierBasedPOSTagger #2781
Comments
This is a consequence of (what I believe to be) a bug in scikit-learn. Let me show you:
{
"prevtag": prevtag,
"prevprevtag": prevprevtag,
"word": word,
"word.lower": word.lower(),
"suffix3": word.lower()[-3:],
"suffix2": word.lower()[-2:],
"suffix1": word.lower()[-1:],
"prevprevword": prevprevword,
"prevword": prevword,
"prevtag+word": f"{prevtag}+{word.lower()}",
"prevprevtag+word": f"{prevprevtag}+{word.lower()}",
"prevword+word": f"{prevword}+{word.lower()}",
"shape": shape,
} Now, in some situations, such as for the first word in the sentence, the previous tag is simply nonexistant, i.e. [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')] Then {'prevprevtag': None,
'prevprevtag+word': 'None+pierre',
'prevprevword': None,
'prevtag': None,
'prevtag+word': 'None+pierre',
'prevword': None,
'prevword+word': 'None+pierre',
'shape': 'upcase',
'suffix1': 'e',
'suffix2': 're',
'suffix3': 'rre',
'word': 'Pierre',
'word.lower': 'pierre'} So, this mapping from feature names to feature values has an expected type of This is passed along the chain, to Line 227 is responsible for handling the case where the feature value is In short - Scikit-learn's My advise is to report this as an issue over at https://github.com/scikit-learn/scikit-learn/issues. You can link to this message if you wish. Perhaps in the meantime you can use the following snippet: class CustomClassifierBasedPOSTagger(ClassifierBasedPOSTagger):
def feature_detector(self, tokens, index, history):
return {
key: str(value) # Ensure that the feature value is a string. Converts None to 'None'
for key, value in super().feature_detector(tokens, index, history).items()
} This will prevent feature values of I don't have the time to actually train something bigger for you, so I quickly trained this with just three sentences: import nltk
from nltk.corpus import treebank
from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import BernoulliNB
from nltk.tag.sequential import ClassifierBasedPOSTagger
nltk.download('treebank')
data = treebank.tagged_sents()
train_data = data[:3]
test_data = data[3:]
class CustomClassifierBasedPOSTagger(ClassifierBasedPOSTagger):
def feature_detector(self, tokens, index, history):
return {
key: str(value) # Ensure that the feature value is a string. Converts None to 'None'
for key, value in super().feature_detector(tokens, index, history).items()
}
bnb = SklearnClassifier(BernoulliNB())
bnb_tagger = CustomClassifierBasedPOSTagger(train=train_data,
classifier_builder=bnb.train,
verbose=True)
sentence = "This is a sample sentence which I just made for fun."
# evaluate tagger on test data and sample sentence
print(bnb_tagger.evaluate(test_data))
# see results on our previously defined sentence
print(bnb_tagger.tag(nltk.word_tokenize(sentence))) Which outputs: [nltk_data] Downloading package treebank to C:\Users\Tom/nltk_data...
[nltk_data] Package treebank is already up-to-date!
Constructing training corpus for classifier.
Training classifier (58 instances)
0.09338289371682999
[('This', 'NNP'), ('is', 'NNP'), ('a', 'NNP'), ('sample', 'NNP'), ('sentence', 'NNP'), ('which', 'NNP'), ('I', 'NNP'), ('just', 'NNP'), ('made', 'NNP'), ('for', 'NNP'), ('fun', 'NNP'), ('.', 'NNP')] (Obviously this evaluates horribly, but the point is that it doesn't break anymore) Happy tagging,
|
@larsmans – in case you have any views on this question of a possible bug in scikit-learn |
I'm trying to build my own classifier based POS tagger using
SklearnClassifier
andClassifierBasedPOSTagger
. The code that I've tried is given below.This code is yielding the following error:
How to do it right?
The text was updated successfully, but these errors were encountered: