PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.

In [1]:
import pickle
import nltk
from nltk.tokenize.punkt import PunktTrainer

# Load your corpus
with open('all_data.json', 'r') as f:
    corpus = f.read()

# Create a new instance of PunktTrainer
trainer = PunktTrainer()

# Train the trainer on your corpus
trainer.train(corpus)

# Serialize the trained model using pickle and save to disk
with open('my_sentence_boundary_detector.punkt', 'wb') as f:
    pickle.dump(trainer, f)

In [13]:
# Load the trained model from disk using pickle
with open('my_sentence_boundary_detector.punkt', 'rb') as f:
    trainer = pickle.load(f)

# Create a new instance of PunktSentenceTokenizer using the trained model
tokenizer = nltk.tokenize.PunktSentenceTokenizer(trainer.get_params())

# Tokenize some text (text_1052.txt) using the new tokenizer
text = '''
Title 40: Protection of Environment PART 180-TOLERANCES AND EXEMPTIONS FOR PESTICIDE CHEMICAL RESIDUES IN FOOD Subpart C-Specific Tolerances $180.466 Fenpropathrin; tolerances for residues. (a) General. Tolerances are established for residues of fenpropathrin, including its metabolites and degradates, in or on the commodities in the following table. Compliance with the tolerance levels specified below is to be determined by measuring only fenpropathrin (alpha-cyano-3-phenoxy-benzyl 2,2,3,3 tetramethylcyclopropanecarboxylate). 1There are no U.S. registrations as of November 28, 2012, for the use of fenpropathrin on tea, dried. (b) Section 18 emergency exemptions. Time-limited tolerances specified in Table 2 to this paragraph (b) are established for residues of fenpropathrin, (alpha-cyano-3-phenoxy- benzyl 2,2,3,3 tetramethylcyclopropane carboxylate) in or on the specified agricultural commodities, resulting from use of the pesticide pursuant to FIFRA section 18 emergency exemptions. The tolerance expires on the date specified in Table 2. TABLE 2 TO PARAGRAPH (b) (c) Tolerances with regional registrations. [Reserved] (d) Indirect or inadvertent residues. [Reserved] [62 FR 63034, Nov. 26, 1997, as amended at 63 FR 48116, Sept. 9, 1998; 64 FR 3009, Jan. 20, 1999; 65 FR 11242, Mar. 2, 2000; 65 FR 24397, Apr. 26, 2000; 65 FR 48620, Aug. 9, 2000; 66 FR 64774, Dec. 14, 2001; 67 FR 35049, May 17, 2002; 70 FR 38789, July 6, 2005; 70 FR 55747, Sept. 23, 2005; 74 FR 12606, Mar. 25, 2009; 77 FR 70908, Nov. 28, 2012; 78 FR 69569, Nov. 20, 2013; 84 FR 70434, Dec. 23, 2019]
'''
# Remove all commas
text = text.replace(",", "")

sentences = tokenizer.tokenize(text)
print(sentences)

['\nTitle 40: Protection of Environment PART 180-TOLERANCES AND EXEMPTIONS FOR PESTICIDE CHEMICAL RESIDUES IN FOOD Subpart C-Specific Tolerances $180.466 Fenpropathrin; tolerances for residues.', '(a) General.', 'Tolerances are established for residues of fenpropathrin including its metabolites and degradates in or on the commodities in the following table.', 'Compliance with the tolerance levels specified below is to be determined by measuring only fenpropathrin (alpha-cyano-3-phenoxy-benzyl 2233 tetramethylcyclopropanecarboxylate).', '1There are no U.S. registrations as of November 28 2012 for the use of fenpropathrin on tea dried.', '(b) Section 18 emergency exemptions.', 'Time-limited tolerances specified in Table 2 to this paragraph (b) are established for residues of fenpropathrin (alpha-cyano-3-phenoxy- benzyl 2233 tetramethylcyclopropane carboxylate) in or on the specified agricultural commodities resulting from use of the pesticide pursuant to FIFRA section 18 emergency exempt

In [14]:
# Concatenate the sentences into a single string
text = ", ".join(sentences)

# Replace commas with newlines
text = text.replace(", ", "\n-----\n")

# Print the modified text
print(text)


Title 40: Protection of Environment PART 180-TOLERANCES AND EXEMPTIONS FOR PESTICIDE CHEMICAL RESIDUES IN FOOD Subpart C-Specific Tolerances $180.466 Fenpropathrin; tolerances for residues.
-----
(a) General.
-----
Tolerances are established for residues of fenpropathrin including its metabolites and degradates in or on the commodities in the following table.
-----
Compliance with the tolerance levels specified below is to be determined by measuring only fenpropathrin (alpha-cyano-3-phenoxy-benzyl 2233 tetramethylcyclopropanecarboxylate).
-----
1There are no U.S. registrations as of November 28 2012 for the use of fenpropathrin on tea dried.
-----
(b) Section 18 emergency exemptions.
-----
Time-limited tolerances specified in Table 2 to this paragraph (b) are established for residues of fenpropathrin (alpha-cyano-3-phenoxy- benzyl 2233 tetramethylcyclopropane carboxylate) in or on the specified agricultural commodities resulting from use of the pesticide pursuant to FIFRA section 18 e