Spanish sentence tokenizer does not take in account added abbreviations #2703

thefirebanks · 2021-04-18T05:14:37Z

Hello!

I am trying to split a spanish text into sentences using the punkt tokenizer. The code is:

>> es_tokenizer = nltk.data.load("tokenizers/punkt/spanish.pickle")
>> chile_abrevs = {"ord", "num", "sra", "no", "corp"}
>> es_tokenizer._params.abbrev_types.update(chile_abrevs)
>> sents = es_tokenizer.tokenize("EL MONUMENTO NATURAL ALERCE COSTERO Y AMPLÍA EL PARQUE NACIONAL ALERCE COSTERO, EN LAS COMUNAS DE CORRAL Y LA UNIÓN, PROVINCIAS DE VALDIVIA Y DEL RANCO, REGIÓN DE LOS RÍOS\n      \n     Núm. 60.- Santiago, 7 de junio de 2013.- Vistos: Estos antecedentes, lo solicitado por la Sra. Ministra de Bienes Nacionales en oficio ord. Nº 916, de 3 de octubre de 2012; lo informado por la Secretaría Regional Ministerial de Los Ríos, mediante oficio ord. Nº 2.957, de 18 de octubre de 2012; lo informado por la Corporación Nacional Forestal (CONAF) y el Ministerio del Medio Ambiente; lo informado por el Asesor Legal de Asuntos Indígenas del Ministerio de Desarrollo Social, mediante oficio ord. Nº 78, de 14 de octubre de 2012; el Acuerdo Nº 26, de 2012, del Consejo de Ministros para la Sustentabilidad; la ley 19.300; el DL Nº 1.939, de 1977;")
>> print(sents)
['EL MONUMENTO NATURAL ALERCE COSTERO Y AMPLÍA EL PARQUE NACIONAL ALERCE COSTERO, EN LAS COMUNAS DE CORRAL Y LA UNIÓN, PROVINCIAS DE VALDIVIA Y DEL RANCO, REGIÓN DE LOS RÍOS\n      \n     Núm.',
 '60.- Santiago, 7 de junio de 2013.- Vistos: Estos antecedentes, lo solicitado por la Sra.',
 'Ministra de Bienes Nacionales en oficio ord. Nº 916, de 3 de octubre de 2012; lo informado por la Secretaría Regional Ministerial de Los Ríos, mediante oficio ord. Nº 2.957, de 18 de octubre de 2012; lo informado por la Corporación Nacional Forestal (CONAF) y el Ministerio del Medio Ambiente; lo informado por el Asesor Legal de Asuntos Indígenas del Ministerio de Desarrollo Social, mediante oficio ord. Nº 78, de 14 de octubre de 2012; el Acuerdo Nº 26, de 2012, del Consejo de Ministros para la Sustentabilidad; la ley 19.300; el DL Nº 1.939, de 1977;']

As you can see, I am trying to extend the dictionary of possible abbreviations to include "num", "ord" and "sra", but while it works with "ord.", the sentences still get separated at "num" and "sra.", which shouldn't happen. Any insights on what I should be doing differently?

Thank you in advance!

P.S: The problem persists even if I remove the accent in "Núm" before attempting the split.

The text was updated successfully, but these errors were encountered:

stevenbird added the nltk_data label Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanish sentence tokenizer does not take in account added abbreviations #2703

Spanish sentence tokenizer does not take in account added abbreviations #2703

thefirebanks commented Apr 18, 2021 •

edited

Spanish sentence tokenizer does not take in account added abbreviations #2703

Spanish sentence tokenizer does not take in account added abbreviations #2703

Comments

thefirebanks commented Apr 18, 2021 • edited

thefirebanks commented Apr 18, 2021 •

edited