It-Chapterize

Chapterize by Jonathan Reeve is a command-line tool that breaks up Gutenberg Project English plain text e-books into chapters, removing both the chapter headings and the text not included between headings.

It-Chapterize is an adaptation of Chapterize for the Italian language with additional minor changes concerning the output.

Main Changes
Installation and Testing
State of the Tool
Tested on

Main Changes

All regular expressions were modified so as to detect the most likely Italian chapters headings
Chapter headings are included at the beginning of each extracted chapter
The value of the delta variable for removing chapter headings that are likely to be part of a Table of Contents was increased
An additional function removes short detected chapters, that are likely to be false positive chapters/spurious text

Installation and Testing

# Clone the repository
git clone https://github.com/GiuseppeDellaCorte/It-Chapterize.git

# Grab a copy of "I tre Moschettieri - Volume 1 " from Project Gutenberg: 
wget https://www.gutenberg.org/files/60641/60641-0.txt

# Run It-Chapterize on it as it follows:  
python /path-to/itchapterize/itchapterize.py /path-to/60641-0.txt

It will output a new directory in the current working directory named 60641-0.txt-chapters, containing files ranging from 01.txt to 16.txt.

State of the Tool

It-Chapterize has been tested on a few set of Italian e-books, which means that the tool does not handle many possible Italian chapter headings.

Tested on

It-Chapterize has been tested successfully on these Italian Gutenberg Project files:

It-Chapterize has also been tested on the Gutenberg Project files that follows this paragraph. It worked relatively well on them, but not perfectly: the output text files include between one and two false positives chapters. In addition, for a few of them, sometimes spurious information are included usually in the first or last detected extracted chapters. Manual correction of false negatives requires around 1/2 minutes per parsed file.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
chapterize		chapterize
LICENSE.md		LICENSE.md
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chapterize

chapterize

LICENSE.md

LICENSE.md

README.md

README.md

setup.cfg

setup.cfg

setup.py

setup.py

shell.nix

shell.nix

Repository files navigation

It-Chapterize

Main Changes

Installation and Testing

State of the Tool

Tested on

About

Releases

Packages

Languages

License

Giuseppe-Della-Corte/It-Chapterize

Folders and files

Latest commit

History

Repository files navigation

It-Chapterize

Main Changes

Installation and Testing

State of the Tool

Tested on

About

Topics

Resources

License

Stars

Watchers

Forks

Languages