Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building own classifier based POS tagger using SklearnClassifier and ClassifierBasedPOSTagger #2781

Open
abdalimran opened this issue Aug 11, 2021 · 2 comments

Comments

@abdalimran
Copy link

I'm trying to build my own classifier based POS tagger using SklearnClassifier and ClassifierBasedPOSTagger. The code that I've tried is given below.

from nltk.corpus import treebank
nltk.download('treebank')

data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]
from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import BernoulliNB
from nltk.tag.sequential import ClassifierBasedPOSTagger

bnb = SklearnClassifier(BernoulliNB())
bnb_tagger = ClassifierBasedPOSTagger(train=train_data,
                                      classifier_builder=bnb.train)

# evaluate tagger on test data and sample sentence
print(bnb_tagger.evaluate(test_data))

# see results on our previously defined sentence
print(bnb_tagger.tag(nltk.word_tokenize(sentence)))

This code is yielding the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Users\ABDULL~1.IMR\AppData\Local\Temp/ipykernel_6580/266992580.py in <module>
      4 
      5 bnb = SklearnClassifier(BernoulliNB())
----> 6 bnb_tagger = ClassifierBasedPOSTagger(train=train_data,
      7                                       classifier_builder=bnb.train)
      8 

~\Miniconda3\envs\nlp_course\lib\site-packages\nltk\tag\sequential.py in __init__(self, feature_detector, train, classifier_builder, classifier, backoff, cutoff_prob, verbose)
    637 
    638         if train:
--> 639             self._train(train, classifier_builder, verbose)
    640 
    641     def choose_tag(self, tokens, index, history):

~\Miniconda3\envs\nlp_course\lib\site-packages\nltk\tag\sequential.py in _train(self, tagged_corpus, classifier_builder, verbose)
    673         if verbose:
    674             print("Training classifier ({} instances)".format(len(classifier_corpus)))
--> 675         self._classifier = classifier_builder(classifier_corpus)
    676 
    677     def __repr__(self):

~\Miniconda3\envs\nlp_course\lib\site-packages\nltk\classify\scikitlearn.py in train(self, labeled_featuresets)
    110 
    111         X, y = list(zip(*labeled_featuresets))
--> 112         X = self._vectorizer.fit_transform(X)
    113         y = self._encoder.fit_transform(y)
    114         self._clf.fit(X, y)

~\Miniconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\_dict_vectorizer.py in fit_transform(self, X, y)
    288             Feature vectors; always 2-d.
    289         """
--> 290         return self._transform(X, fitting=True)
    291 
    292     def inverse_transform(self, X, dict_type=dict):

~\Miniconda3\envs\nlp_course\lib\site-packages\sklearn\feature_extraction\_dict_vectorizer.py in _transform(self, X, fitting)
    233                     if feature_name in vocab:
    234                         indices.append(vocab[feature_name])
--> 235                         values.append(self.dtype(v))
    236 
    237             indptr.append(len(indices))

TypeError: float() argument must be a string or a number, not 'NoneType'

How to do it right?

@tomaarsen
Copy link
Member

This is a consequence of (what I believe to be) a bug in scikit-learn. Let me show you:

ClassifierBasedPOSTagger (and it's superclass ClassifierBasedTagger) uses a feature_detector method, which given some parameters (tokens, index, history) produces a dictionary like the following:

{
    "prevtag": prevtag,
    "prevprevtag": prevprevtag,
    "word": word,
    "word.lower": word.lower(),
    "suffix3": word.lower()[-3:],
    "suffix2": word.lower()[-2:],
    "suffix1": word.lower()[-1:],
    "prevprevword": prevprevword,
    "prevword": prevword,
    "prevtag+word": f"{prevtag}+{word.lower()}",
    "prevprevtag+word": f"{prevprevtag}+{word.lower()}",
    "prevword+word": f"{prevword}+{word.lower()}",
    "shape": shape,
}

Now, in some situations, such as for the first word in the sentence, the previous tag is simply nonexistant, i.e. None. For example, if we take the sample training sentence:

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

Then ClassifierBasedPOSTagger will in _train call feature_detector, producing:

{'prevprevtag': None,
 'prevprevtag+word': 'None+pierre',
 'prevprevword': None,
 'prevtag': None,
 'prevtag+word': 'None+pierre',
 'prevword': None,
 'prevword+word': 'None+pierre',
 'shape': 'upcase',
 'suffix1': 'e',
 'suffix2': 're',
 'suffix3': 'rre',
 'word': 'Pierre',
 'word.lower': 'pierre'}

So, this mapping from feature names to feature values has an expected type of Dict[str, Optional[str]].

This is passed along the chain, to sklearn\feature_extraction\_dict_vectorizer.py, to a _transform method of DictVectorizer. With x as a feature dict like the last example, the following section of code is run:
https://github.com/scikit-learn/scikit-learn/blob/e64714637d8cc9f4724ae21ea500e4bdc57b0a39/sklearn/feature_extraction/_dict_vectorizer.py#L223-L255

Line 227 is responsible for handling the case where the feature value is None, setting the feature name to e.g. "prevprevtag". However, then in line 255, the dtype (defaults to np.float64) of our feature value is taken, which produces np.float64(None). This throws the TypeError you're experiencing.


In short - Scikit-learn's _transform method of DictVectorizer in sklearn/feature_extraction/_dict_vectorizer.py fails when the input argument X contains mappings to None.


My advise is to report this as an issue over at https://github.com/scikit-learn/scikit-learn/issues. You can link to this message if you wish.

Perhaps in the meantime you can use the following snippet:

class CustomClassifierBasedPOSTagger(ClassifierBasedPOSTagger):

    def feature_detector(self, tokens, index, history):
        return {
            key: str(value) # Ensure that the feature value is a string. Converts None to 'None'
            for key, value in super().feature_detector(tokens, index, history).items()
        }

This will prevent feature values of None while the bug persists. Instead, it uses 'None'. Alternatively, you can come up with whatever token you want to use to replace None, e.g. key: value if value else 'my_custom_value_which_represents_none'.

I don't have the time to actually train something bigger for you, so I quickly trained this with just three sentences:

import nltk
from nltk.corpus import treebank

from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import BernoulliNB
from nltk.tag.sequential import ClassifierBasedPOSTagger

nltk.download('treebank')

data = treebank.tagged_sents()
train_data = data[:3]
test_data = data[3:]

class CustomClassifierBasedPOSTagger(ClassifierBasedPOSTagger):

    def feature_detector(self, tokens, index, history):
        return {
            key: str(value) # Ensure that the feature value is a string. Converts None to 'None'
            for key, value in super().feature_detector(tokens, index, history).items()
        }

bnb = SklearnClassifier(BernoulliNB())
bnb_tagger = CustomClassifierBasedPOSTagger(train=train_data,
                                            classifier_builder=bnb.train,
                                            verbose=True)

sentence = "This is a sample sentence which I just made for fun."
# evaluate tagger on test data and sample sentence
print(bnb_tagger.evaluate(test_data))

# see results on our previously defined sentence
print(bnb_tagger.tag(nltk.word_tokenize(sentence)))

Which outputs:

[nltk_data] Downloading package treebank to C:\Users\Tom/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
Constructing training corpus for classifier.
Training classifier (58 instances)
0.09338289371682999
[('This', 'NNP'), ('is', 'NNP'), ('a', 'NNP'), ('sample', 'NNP'), ('sentence', 'NNP'), ('which', 'NNP'), ('I', 'NNP'), ('just', 'NNP'), ('made', 'NNP'), ('for', 'NNP'), ('fun', 'NNP'), ('.', 'NNP')]

(Obviously this evaluates horribly, but the point is that it doesn't break anymore)

Happy tagging,

  • Tom Aarsen

@stevenbird
Copy link
Member

@larsmans – in case you have any views on this question of a possible bug in scikit-learn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants