-
Notifications
You must be signed in to change notification settings - Fork 3k
Porting your code to NLTK 3.0
NLTK 3.0 contains a number of interface changes. These are being incorporated into a new version of the NLTK book, updated for Python 3 and NLTK 3.
The way NLTK works with unicode is changed: NLTK 3 requires all text input to be unicode and always return text as unicode. Previously, some functions and classes worked on unicode and others required encoded bytestrings. Please make sure you're passing unicode to NLTK and expecting unicode output from NLTK - existing code that assumes bytestrings may start to fail.
Here are some changes you may need to make:
-
grammar:ContextFreeGrammar→CFG,WeightedGrammar→PCFG,StatisticalDependencyGrammar→ProbabilisticDependencyGrammar,WeightedProduction→ProbabilisticProduction -
draw.tree:TreeSegmentWidget.node()→TreeSegmentWidget.label(),TreeSegmentWidget.set_node()→TreeSegmentWidget.set_label() - parsers:
nbest_parse()→parse() -
ccg.parse.chart:EdgeI.next()→EdgeI.nextsym() - Chunk parser:
top_node→root_label;chunk_node→chunk_label - WordNet properties are now access methods, e.g.
Synset.definition→Synset.definition() -
sem.relextract:mk_pairs()→_tree2semi_rel(),mk_reldicts()→semi_rel2reldict(),show_clause()→clause(),show_raw_rtuple()→rtuple() -
corpusname.tagged_words(simplify_tags=True)→corpusname.tagged_words(tagset='universal') -
util.clean_html()→BeautifulSoup.get_text().clean_html()is now dropped, install & use BeautifulSoup or some other html parser instead. -
util.ibigrams()→util.bigrams() -
util.ingrams()→util.ngrams() -
util.itrigrams()→util.trigrams() -
metrics.windowdiff→metrics.segmentation.windowdiff(),metrics.windowdiff.demo()was removed. -
parse.generate2was re-written and merged intoparse.generate
Creating objects from strings:
- Many objects now support a
fromstring()method -
tree.Tree.parse()→tree.Tree.fromstring() -
tree.Tree()→tree.Tree.fromstring() -
chunk.RegexpChunkRule.parse()→chunkRegexpChunkRule.fromstring() -
grammar.parse_cfg()→CFG.fromstring()(same for other types of grammar) -
sem.LogicParser.parse()→sem.Expression.fromstring() -
sem.DrtParser.parse()→sem.DrtExpression.fromstring() -
sem.parse_valuation()→sem.Valuation.fromstring() -
sem.parse_type()→sem.Type.fromstring()
Operations on lists of sentences or other items:
-
tokenize.batch_tokenize()→tokenize.tokenize_sents() -
tag.batch_tag()→tag.tag_sents() -
parse.batch_parse()→parse.parse_sents() -
classify.batch_classify()→classify.classify_many() -
sem.batch_interpret()→sem.interpret_sents() -
sem.batch_evaluate()→sem.evaluate_sents() -
chunk.batch_ne_chunk()→chunk.ne_chunk_sents()
Changes in probability.FreqDist:
-
fdist.keys()→sorted(fdist) -
fdist.inc(x)→fdist[x] += 1 -
fdist.samples()→sorted(fdist.keys()) -
fdist.Nr(r)→fdist.Nr()[r] -
fdist.Nr_nonzero()→fdist.Nr().items() -
cfdist.conditions()→sorted(cfdist.conditions())
Porter stemmer changes:
-
adjust_case(),cons(),cvc(),doublec(),m(),step1ab(),step1c(),step2(),step3(),step4(),step5(),vowelinstem()made private -
ends(),r(),setto()removed
Removed modules, classes and functions:
-
classify.svmwas removed. For classification based on support vector machines (SVMs) useclassify.scikitlearnor scikit-learn directly. See https://github.com/nltk/nltk/issues/450. -
probability.GoodTuringProbDistclass was removed. See https://github.com/nltk/nltk/issues/381. -
HiddenMarkovModelTaggerTransformIand its subclasses are removed. See https://github.com/nltk/nltk/issues/374. -
classify.maxentno longer support algorithms backed byscipy.maxentropy. See https://github.com/nltk/nltk/issues/321. -
misc.babelfishwas removed. See https://github.com/nltk/nltk/issues/265. -
sourcedstringwas removed. See https://github.com/nltk/nltk/issues/322. -
yamltagswas removed. JSON is now preferred instead. See https://github.com/nltk/nltk/issues/540 -
malletwas removed, including thetag.crfmodule. See https://github.com/nltk/nltk/issues/104 -
tag.simplifywas removed. See https://github.com/nltk/nltk/issues/483 -
modelwas removed. See https://github.com/nltk/nltk/issues?labels=model -
corpus.reader.wordnet._lcs_by_depthwas removed. See https://github.com/nltk/nltk/issues/422.
Miscellaneous changes:
-
probability.ConditionalProbDist.default_factorynow inherits fromdictinstead ofdefaultdict -
probability.ConditionalProbDistI.default_factorynow inherits fromdictinstead ofdefaultdict -
probability.DictionaryConditionalProbDist.default_factorynow inherits fromdictinstead ofdefaultdict -
tag.senna.SennaTagger→classify.Senna -
tag.senna.POSTagger→tag.SennaTagger -
tag.senna.CHKTagger→tag.SennaChunkTagger
Printing changes (from 3.0.2, see https://github.com/nltk/nltk/issues/804):
-
classify.decisiontree.DecisionTreeClassifier.pp→pretty_format -
metrics.confusionmatrix.ConfusionMatrix.pp→pretty_format -
sem.lfg.FStructure.pprint→pretty_format -
sem.drt.DrtExpression.pretty→pretty_format -
parse.chart.Chart.pp→pretty_format -
Tree.pprint()→pformat -
FreqDist.pprint→pformat -
Tree.pretty_print→pprint -
Tree.pprint_latex_qtree→pformat_latex_qtree
Environment variables for third-party software:
- These have been normalised; please see Installing Third Party Software
More background on Python 3 and NLTK 3: