Porting your code to NLTK 3.0

NLTK 3.0 contains a number of interface changes. These are being incorporated into a new version of the NLTK book, updated for Python 3 and NLTK 3.

The way NLTK works with unicode is changed: NLTK 3 requires all text input to be unicode and always return text as unicode. Previously, some functions and classes worked on unicode and others required encoded bytestrings. Please make sure you're passing unicode to NLTK and expecting unicode output from NLTK - existing code that assumes bytestrings may start to fail.

Here are some changes you may need to make:

grammar: ContextFreeGrammar → CFG, WeightedGrammar → PCFG, StatisticalDependencyGrammar → ProbabilisticDependencyGrammar, WeightedProduction → ProbabilisticProduction
draw.tree: TreeSegmentWidget.node() → TreeSegmentWidget.label(), TreeSegmentWidget.set_node() → TreeSegmentWidget.set_label()
parsers: nbest_parse() → parse()
ccg.parse.chart: EdgeI.next() → EdgeI.nextsym()
Chunk parser: top_node → root_label; chunk_node → chunk_label
WordNet properties are now access methods, e.g. Synset.definition → Synset.definition()
sem.relextract: mk_pairs() → _tree2semi_rel(), mk_reldicts() → semi_rel2reldict(), show_clause() → clause(), show_raw_rtuple() → rtuple()
corpusname.tagged_words(simplify_tags=True) → corpusname.tagged_words(tagset='universal')
util.clean_html() → BeautifulSoup.get_text(). clean_html() is now dropped, install & use BeautifulSoup or some other html parser instead.
util.ibigrams() → util.bigrams()
util.ingrams() → util.ngrams()
util.itrigrams() → util.trigrams()
metrics.windowdiff → metrics.segmentation.windowdiff(), metrics.windowdiff.demo() was removed.
parse.generate2 was re-written and merged into parse.generate

Creating objects from strings:

Many objects now support a fromstring() method
tree.Tree.parse() → tree.Tree.fromstring()
tree.Tree() → tree.Tree.fromstring()
chunk.RegexpChunkRule.parse() → chunkRegexpChunkRule.fromstring()
grammar.parse_cfg() → CFG.fromstring() (same for other types of grammar)
sem.LogicParser.parse() → sem.Expression.fromstring()
sem.DrtParser.parse() → sem.DrtExpression.fromstring()
sem.parse_valuation() → sem.Valuation.fromstring()
sem.parse_type() → sem.Type.fromstring()

Operations on lists of sentences or other items:

tokenize.batch_tokenize() → tokenize.tokenize_sents()
tag.batch_tag() → tag.tag_sents()
parse.batch_parse() → parse.parse_sents()
classify.batch_classify() → classify.classify_many()
sem.batch_interpret() → sem.interpret_sents()
sem.batch_evaluate() → sem.evaluate_sents()
chunk.batch_ne_chunk() → chunk.ne_chunk_sents()

Changes in probability.FreqDist:

fdist.keys() → sorted(fdist)
fdist.inc(x) → fdist[x] += 1
fdist.samples() → sorted(fdist.keys())
fdist.Nr(r) → fdist.Nr()[r]
fdist.Nr_nonzero() → fdist.Nr().items()
cfdist.conditions() → sorted(cfdist.conditions())

Porter stemmer changes:

adjust_case(), cons(), cvc(), doublec(), m(), step1ab(), step1c(), step2(), step3(), step4(), step5(), vowelinstem() made private
ends(), r(), setto() removed

Removed modules, classes and functions:

classify.svm was removed. For classification based on support vector machines (SVMs) use classify.scikitlearn or scikit-learn directly. See https://github.com/nltk/nltk/issues/450.
probability.GoodTuringProbDist class was removed. See https://github.com/nltk/nltk/issues/381.
HiddenMarkovModelTaggerTransformI and its subclasses are removed. See https://github.com/nltk/nltk/issues/374.
classify.maxent no longer support algorithms backed by scipy.maxentropy. See https://github.com/nltk/nltk/issues/321.
misc.babelfish was removed. See https://github.com/nltk/nltk/issues/265.
sourcedstring was removed. See https://github.com/nltk/nltk/issues/322.
yamltags was removed. JSON is now preferred instead. See https://github.com/nltk/nltk/issues/540
mallet was removed, including the tag.crf module. See https://github.com/nltk/nltk/issues/104
tag.simplify was removed. See https://github.com/nltk/nltk/issues/483
model was removed. See https://github.com/nltk/nltk/issues?labels=model
corpus.reader.wordnet._lcs_by_depth was removed. See https://github.com/nltk/nltk/issues/422.

Miscellaneous changes:

probability.ConditionalProbDist.default_factory now inherits from dict instead of defaultdict
probability.ConditionalProbDistI.default_factory now inherits from dict instead of defaultdict
probability.DictionaryConditionalProbDist.default_factory now inherits from dict instead of defaultdict
tag.senna.SennaTagger → classify.Senna
tag.senna.POSTagger → tag.SennaTagger
tag.senna.CHKTagger → tag.SennaChunkTagger

Printing changes (from 3.0.2, see https://github.com/nltk/nltk/issues/804):

classify.decisiontree.DecisionTreeClassifier.pp → pretty_format
metrics.confusionmatrix.ConfusionMatrix.pp → pretty_format
sem.lfg.FStructure.pprint → pretty_format
sem.drt.DrtExpression.pretty → pretty_format
parse.chart.Chart.pp → pretty_format
Tree.pprint() → pformat
FreqDist.pprint → pformat
Tree.pretty_print → pprint
Tree.pprint_latex_qtree → pformat_latex_qtree

Environment variables for third-party software:

These have been normalised; please see Installing Third Party Software

More background on Python 3 and NLTK 3:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Porting your code to NLTK 3.0

Clone this wiki locally