-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Porting your code to NLTK 3.0
NLTK 3.0 contains a number of interface changes. These are being incorporated into a new version of the NLTK book, updated for Python 3 and NLTK 3.
The way NLTK works with unicode is changed: NLTK 3 requires all text input to be unicode and always return text as unicode. Previously, some functions and classes worked on unicode and others required encoded bytestrings. Please make sure you're passing unicode to NLTK and expecting unicode output from NLTK - existing code that assumes bytestrings may start to fail.
Here are some changes you may need to make:
-
grammar
:ContextFreeGrammar
→CFG
,WeightedGrammar
→PCFG
,StatisticalDependencyGrammar
→ProbabilisticDependencyGrammar
,WeightedProduction
→ProbabilisticProduction
-
draw.tree
:TreeSegmentWidget.node()
→TreeSegmentWidget.label()
,TreeSegmentWidget.set_node()
→TreeSegmentWidget.set_label()
- parsers:
nbest_parse()
→parse()
-
ccg.parse.chart
:EdgeI.next()
→EdgeI.nextsym()
- Chunk parser:
top_node
→root_label
;chunk_node
→chunk_label
- WordNet properties are now access methods, e.g.
Synset.definition
→Synset.definition()
-
sem.relextract
:mk_pairs()
→_tree2semi_rel()
,mk_reldicts()
→semi_rel2reldict()
,show_clause()
→clause()
,show_raw_rtuple()
→rtuple()
-
corpusname.tagged_words(simplify_tags=True)
→corpusname.tagged_words(tagset='universal')
-
util.clean_html()
→BeautifulSoup.get_text()
.clean_html()
is now dropped, install & use BeautifulSoup or some other html parser instead. -
util.ibigrams()
→util.bigrams()
-
util.ingrams()
→util.ngrams()
-
util.itrigrams()
→util.trigrams()
-
metrics.windowdiff
→metrics.segmentation.windowdiff()
,metrics.windowdiff.demo()
was removed. -
parse.generate2
was re-written and merged intoparse.generate
Creating objects from strings:
- Many objects now support a
fromstring()
method -
tree.Tree.parse()
→tree.Tree.fromstring()
-
tree.Tree()
→tree.Tree.fromstring()
-
chunk.RegexpChunkRule.parse()
→chunkRegexpChunkRule.fromstring()
-
grammar.parse_cfg()
→CFG.fromstring()
(same for other types of grammar) -
sem.LogicParser.parse()
→sem.Expression.fromstring()
-
sem.DrtParser.parse()
→sem.DrtExpression.fromstring()
-
sem.parse_valuation()
→sem.Valuation.fromstring()
-
sem.parse_type()
→sem.Type.fromstring()
Operations on lists of sentences or other items:
-
tokenize.batch_tokenize()
→tokenize.tokenize_sents()
-
tag.batch_tag()
→tag.tag_sents()
-
parse.batch_parse()
→parse.parse_sents()
-
classify.batch_classify()
→classify.classify_many()
-
sem.batch_interpret()
→sem.interpret_sents()
-
sem.batch_evaluate()
→sem.evaluate_sents()
-
chunk.batch_ne_chunk()
→chunk.ne_chunk_sents()
Changes in probability.FreqDist
:
-
fdist.keys()
→sorted(fdist)
-
fdist.inc(x)
→fdist[x] += 1
-
fdist.samples()
→sorted(fdist.keys())
-
fdist.Nr(r)
→fdist.Nr()[r]
-
fdist.Nr_nonzero()
→fdist.Nr().items()
-
cfdist.conditions()
→sorted(cfdist.conditions())
Porter stemmer changes:
-
adjust_case()
,cons()
,cvc()
,doublec()
,m()
,step1ab()
,step1c()
,step2()
,step3()
,step4()
,step5()
,vowelinstem()
made private -
ends()
,r()
,setto()
removed
Removed modules, classes and functions:
-
classify.svm
was removed. For classification based on support vector machines (SVMs) useclassify.scikitlearn
or scikit-learn directly. See https://github.com/nltk/nltk/issues/450. -
probability.GoodTuringProbDist
class was removed. See https://github.com/nltk/nltk/issues/381. -
HiddenMarkovModelTaggerTransformI
and its subclasses are removed. See https://github.com/nltk/nltk/issues/374. -
classify.maxent
no longer support algorithms backed byscipy.maxentropy
. See https://github.com/nltk/nltk/issues/321. -
misc.babelfish
was removed. See https://github.com/nltk/nltk/issues/265. -
sourcedstring
was removed. See https://github.com/nltk/nltk/issues/322. -
yamltags
was removed. JSON is now preferred instead. See https://github.com/nltk/nltk/issues/540 -
mallet
was removed, including thetag.crf
module. See https://github.com/nltk/nltk/issues/104 -
tag.simplify
was removed. See https://github.com/nltk/nltk/issues/483 -
model
was removed. See https://github.com/nltk/nltk/issues?labels=model -
corpus.reader.wordnet._lcs_by_depth
was removed. See https://github.com/nltk/nltk/issues/422.
Miscellaneous changes:
-
probability.ConditionalProbDist.default_factory
now inherits fromdict
instead ofdefaultdict
-
probability.ConditionalProbDistI.default_factory
now inherits fromdict
instead ofdefaultdict
-
probability.DictionaryConditionalProbDist.default_factory
now inherits fromdict
instead ofdefaultdict
-
tag.senna.SennaTagger
→classify.Senna
-
tag.senna.POSTagger
→tag.SennaTagger
-
tag.senna.CHKTagger
→tag.SennaChunkTagger
Printing changes (from 3.0.2, see https://github.com/nltk/nltk/issues/804):
-
classify.decisiontree.DecisionTreeClassifier.pp
→pretty_format
-
metrics.confusionmatrix.ConfusionMatrix.pp
→pretty_format
-
sem.lfg.FStructure.pprint
→pretty_format
-
sem.drt.DrtExpression.pretty
→pretty_format
-
parse.chart.Chart.pp
→pretty_format
-
Tree.pprint()
→pformat
-
FreqDist.pprint
→pformat
-
Tree.pretty_print
→pprint
-
Tree.pprint_latex_qtree
→pformat_latex_qtree
Environment variables for third-party software:
- These have been normalised; please see Installing Third Party Software
More background on Python 3 and NLTK 3: