Practice using CategorizedPlaintextCorpusReader to read text files.

NLTK has many CorpusReader's.  In general we pass in path to the root of the corpus, signature to identify files (usually regex) and possible file type.  

We need to make sure signatures capture the files of interest, while ignoring extraneous files we do not wish to analyze.  

In [2]:
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader

DOC_PATTERN = r'(?!\.)[\w_\s]+/[\w\s\d\-]+\.txt'
CAT_PATTERN = r'([\w_\s]+)/.*'

corpus = CategorizedPlaintextCorpusReader(
    'Star_Wars_Corpus', DOC_PATTERN, cat_pattern=CAT_PATTERN
)

In [3]:
#Now see what categories were captured
corpus.categories()

['Star Trek', 'Star Wars']

In [5]:
#And see what FileID's were captured
corpus.fileids()

['Star Trek/Star Trek - Balance of Terror 2.txt',
 'Star Trek/Star Trek - Balance of Terror.txt',
 'Star Wars/Star Wars Episode 1.txt',
 'Star Wars/Star Wars Episode 2.txt',
 'Star Wars/other_info.txt']

NLTK doesn't directly have an HTML Corpus Reader, but we can create one.

In [6]:
from nltk.corpus.reader.api import CorpusReader
from nltk.corpus.reader.api import CategorizedCorpusReader

CAT_PATTERN = r'([a-z_\s]+)/.*'
DOC_PATTERN = r'(?!\.)[a-z_\s]+/[a-f0-9]+\.json'
TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li']

class HTMLCorpusReader(CategorizedCorpusReader, CorpusReader):
    """
    A corpus reader for raw HTML documents to enable preprocessing.
    """

    def __init__(self, root, fileids=DOC_PATTERN, encoding='utf8',
                 tags=TAGS, **kwargs):
        """
        Initialize the corpus reader.  Categorization arguments
        (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to
        the ``CategorizedCorpusReader`` constructor.  The remaining
        arguments are passed to the ``CorpusReader`` constructor.
        """
        # Add the default category pattern if not passed into the class.
        if not any(key.startswith('cat_') for key in kwargs.keys()):
            kwargs['cat_pattern'] = CAT_PATTERN

        # Initialize the NLTK corpus reader objects
        CategorizedCorpusReader.__init__(self, kwargs)
        CorpusReader.__init__(self, root, fileids, encoding)

        # Save the tags that we specifically want to extract.
        self.tags = tags