Skip to content

A tool for extracting chapters from Gutenberg Project Italian raw text e-books. RegEx are used to match chapter headings and extract the text between them.

License

Notifications You must be signed in to change notification settings

Giuseppe-Della-Corte/It-Chapterize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

It-Chapterize

Chapterize by Jonathan Reeve is a command-line tool that breaks up Gutenberg Project English plain text e-books into chapters, removing both the chapter headings and the text not included between headings.

It-Chapterize is an adaptation of Chapterize for the Italian language with additional minor changes concerning the output.

Main Changes

  • All regular expressions were modified so as to detect the most likely Italian chapters headings
  • Chapter headings are included at the beginning of each extracted chapter
  • The value of the delta variable for removing chapter headings that are likely to be part of a Table of Contents was increased
  • An additional function removes short detected chapters, that are likely to be false positive chapters/spurious text

Installation and Testing

# Clone the repository
git clone https://github.com/GiuseppeDellaCorte/It-Chapterize.git

# Grab a copy of "I tre Moschettieri - Volume 1 " from Project Gutenberg: 
wget https://www.gutenberg.org/files/60641/60641-0.txt

# Run It-Chapterize on it as it follows:  
python /path-to/itchapterize/itchapterize.py /path-to/60641-0.txt

It will output a new directory in the current working directory named 60641-0.txt-chapters, containing files ranging from 01.txt to 16.txt.

State of the Tool

It-Chapterize has been tested on a few set of Italian e-books, which means that the tool does not handle many possible Italian chapter headings.

Tested on

It-Chapterize has been tested successfully on these Italian Gutenberg Project files:

It-Chapterize has also been tested on the Gutenberg Project files that follows this paragraph. It worked relatively well on them, but not perfectly: the output text files include between one and two false positives chapters. In addition, for a few of them, sometimes spurious information are included usually in the first or last detected extracted chapters. Manual correction of false negatives requires around 1/2 minutes per parsed file.

About

A tool for extracting chapters from Gutenberg Project Italian raw text e-books. RegEx are used to match chapter headings and extract the text between them.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published