Porting your code to NLTK 3.0

Santiago Castro edited this page Jul 30, 2015 · 74 revisions

NLTK 3.0 contains a number of interface changes. These are being incorporated into a new version of the NLTK book, updated for Python 3 and NLTK 3.

The way NLTK works with unicode is changed: NLTK 3 requires all text input to be unicode and always return text as unicode. Previously, some functions and classes worked on unicode and others required encoded bytestrings. Please make sure you're passing unicode to NLTK and expecting unicode output from NLTK - existing code that assumes bytestrings may start to fail.

Here are some changes you may need to make:

  • grammar: ContextFreeGrammarCFG, WeightedGrammarPCFG, StatisticalDependencyGrammarProbabilisticDependencyGrammar, WeightedProductionProbabilisticProduction
  • draw.tree: TreeSegmentWidget.node()TreeSegmentWidget.label(), TreeSegmentWidget.set_node()TreeSegmentWidget.set_label()
  • parsers: nbest_parse()parse()
  • ccg.parse.chart: EdgeI.next()EdgeI.nextsym()
  • Chunk parser: top_noderoot_label; chunk_nodechunk_label
  • WordNet properties are now access methods, e.g. Synset.definitionSynset.definition()
  • sem.relextract: mk_pairs()_tree2semi_rel(), mk_reldicts()semi_rel2reldict(), show_clause()clause(), show_raw_rtuple()rtuple()
  • corpusname.tagged_words(simplify_tags=True)corpusname.tagged_words(tagset='universal')
  • util.clean_html()BeautifulSoup.get_text(). clean_html() is now dropped, install & use BeautifulSoup or some other html parser instead.
  • util.ibigrams()util.bigrams()
  • util.ingrams()util.ngrams()
  • util.itrigrams()util.trigrams()
  • metrics.windowdiffmetrics.segmentation.windowdiff(), metrics.windowdiff.demo() was removed.
  • parse.generate2 was re-written and merged into parse.generate

Creating objects from strings:

  • Many objects now support a fromstring() method
  • tree.Tree.parse()tree.Tree.fromstring()
  • tree.Tree()tree.Tree.fromstring()
  • chunk.RegexpChunkRule.parse()chunkRegexpChunkRule.fromstring()
  • grammar.parse_cfg()CFG.fromstring() (same for other types of grammar)
  • sem.LogicParser.parse()sem.Expression.fromstring()
  • sem.DrtParser.parse()sem.DrtExpression.fromstring()
  • sem.parse_valuation()sem.Valuation.fromstring()
  • sem.parse_type()sem.Type.fromstring()

Operations on lists of sentences or other items:

  • tokenize.batch_tokenize()tokenize.tokenize_sents()
  • tag.batch_tag()tag.tag_sents()
  • parse.batch_parse()parse.parse_sents()
  • classify.batch_classify()classify.classify_many()
  • sem.batch_interpret()sem.interpret_sents()
  • sem.batch_evaluate()sem.evaluate_sents()
  • chunk.batch_ne_chunk()chunk.ne_chunk_sents()

Changes in probability.FreqDist:

  • fdist.keys()sorted(fdist)
  • fdist.inc(x)fdist[x] += 1
  • fdist.samples()sorted(fdist.keys())
  • fdist.Nr(r)fdist.Nr()[r]
  • fdist.Nr_nonzero()fdist.Nr().items()
  • cfdist.conditions()sorted(cfdist.conditions())

Porter stemmer changes:

  • adjust_case(), cons(), cvc(), doublec(), m(), step1ab(), step1c(), step2(), step3(), step4(), step5(), vowelinstem() made private
  • ends(), r(), setto() removed

Removed modules, classes and functions:

Miscellaneous changes:

  • probability.ConditionalProbDist.default_factory now inherits from dict instead of defaultdict
  • probability.ConditionalProbDistI.default_factory now inherits from dict instead of defaultdict
  • probability.DictionaryConditionalProbDist.default_factory now inherits from dict instead of defaultdict
  • tag.senna.SennaTaggerclassify.Senna
  • tag.senna.POSTaggertag.SennaTagger
  • tag.senna.CHKTaggertag.SennaChunkTagger

Printing changes (from 3.0.2, see https://github.com/nltk/nltk/issues/804):

  • classify.decisiontree.DecisionTreeClassifier.pppretty_format
  • metrics.confusionmatrix.ConfusionMatrix.pppretty_format
  • sem.lfg.FStructure.pprintpretty_format
  • sem.drt.DrtExpression.prettypretty_format
  • parse.chart.Chart.pppretty_format
  • Tree.pprint()pformat
  • FreqDist.pprintpformat
  • Tree.pretty_printpprint
  • Tree.pprint_latex_qtreepformat_latex_qtree

Environment variables for third-party software:

More background on Python 3 and NLTK 3: