This book proposes following framework for reading in raw text and processing it into a format suitable for computation and modeling.  This framework has 5 steps: content extraction, paragraph blocking, sentence segmentation, word tokenization, and part-of-speech tagging.

Also of note, this book explores this framework through a custom built HTMLCorpusReader.  NLTK has a generic `CorpusReader` object to read and stream text data, as well as dozens of specific corpus readers to handle unique text formats, such as PlaintextCorpusReader, TwitterCorpusReader, and many others.  However, isn't a built in class to read in HTML text data (text data embedded in HTML tagged web docs).  So lets build one here, and then use it to explore basics of processing text into formats suitable for computation.

In [1]:
from nltk.corpus.reader.api import CorpusReader
from nltk.corpus.reader.api import CategorizedCorpusReader

CAT_PATTERN = r'([a-z_\s]+)/.*'
DOC_PATTERN = r'(?!\.)[a-z_\s]+/[a-f0-9]+\.json'
TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li']

class HTMLCorpusReader(CategorizedCorpusReader, CorpusReader):
    """
    A corpus reader for raw HTML documents to enable preprocessing.
    """

    def __init__(self, root, fileids=DOC_PATTERN, encoding='utf8',
                 tags=TAGS, **kwargs):
        """
        Initialize the corpus reader.  Categorization arguments
        (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to
        the ``CategorizedCorpusReader`` constructor.  The remaining
        arguments are passed to the ``CorpusReader`` constructor.
        """
        # Add the default category pattern if not passed into the class.
        if not any(key.startswith('cat_') for key in kwargs.keys()):
            kwargs['cat_pattern'] = CAT_PATTERN

        # Initialize the NLTK corpus reader objects
        CategorizedCorpusReader.__init__(self, kwargs)
        CorpusReader.__init__(self, root, fileids, encoding)

        # Save the tags that we specifically want to extract.
        self.tags = tags