# Creating HTMLCorpusReader

### Let's create HTMLCorpusReader object. In this chapter we will start creating HTMLCorpusReader. In next chapter we will continue working on the HTMLCorpusReader object

In [1]:
#Import required libraries

from nltk.corpus.reader.api import CorpusReader, CategorizedCorpusReader
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader
import codecs
import os

## Toy Example


#### Firstly, let's try to create CategorizedPlaintextCorpusReader object. For this task we need to describe category_name and document_name templates. For direct creation CategorizedPlaintextCorpusReader object we need to transfer three arguments. First argument is a path to directory where are located category folders, second argument is a document_name template and the third - category_name template.

In [2]:
CAT_PATTERN = r'([\w_\s]+)/.*'
DOC_PATTERN = r'(?!\.)[\w_\s]+/[\w\s\d\-]+\.txt'

corpus = CategorizedPlaintextCorpusReader(
'C:/Users/79771/anaconda3/Scripts/Projects/corpus', DOC_PATTERN, cat_pattern = CAT_PATTERN)

corpus.categories()
# return ['Star Trek', 'Star Wars']

corpus.fileids()
# return ['Star Trek/Star Trek - Balance of Terror.txt','Star Wars/Star Wars Episode 1.txt']

['Star Trek/Star Trek - Balance of Terror.txt',
 'Star Wars/Star Wars Episode 1.txt']

## Now let's start to solve main task. Let's create a HTMLCorpusReader class.

### Firstly, let's describe category html_file_name and category_name templates.

In [5]:
CATEGORY_PATTERN = r'([\w_\s]+)/.*'
HTML_PATTERN = r'(?!\.)[\w_\s]+/[\w\s\d\-\,\.\«\»\(\)\–]+\.html'
TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li']

### Next we create a HTMLCorpusReader class. This class has 4 methods. There are __init__, resolve, docs, sizes.

In [6]:
class HTMLCorpusReader(CategorizedCorpusReader, CorpusReader):
    """
    A corpus reader for raw HTML documents to enable preprocessing.
    """

    def __init__(self, root, fileids=HTML_PATTERN, encoding='utf8',
                 tags=TAGS, **kwargs):
        """
        Initialize the corpus reader.  Categorization arguments
        (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to
        the ``CategorizedCorpusReader`` constructor.  The remaining
        arguments are passed to the ``CorpusReader`` constructor.
        """
        # Add the default category pattern if not passed into the class.
        if not any(key.startswith('cat_') for key in kwargs.keys()):
            kwargs['cat_pattern'] = CATEGORY_PATTERN

        # Initialize the NLTK corpus reader objects
        CategorizedCorpusReader.__init__(self, kwargs)
        CorpusReader.__init__(self, root, fileids, encoding)

        # Save the tags that we specifically want to extract.
        self.tags = tags

    def resolve(self, fileids, categories):
        """
        Returns a list of fileids or categories depending on what is passed
        to each internal corpus reader function. Implemented similarly to
        the NLTK ``CategorizedPlaintextCorpusReader``.
        """
        if fileids is not None and categories is not None:
            raise ValueError("Specify fileids or categories, not both")

        if categories is not None:
            return self.fileids(categories)
        return fileids

    def docs(self, fileids=None, categories=None):
        """
        Returns the complete text of an HTML document, closing the document
        after we are done reading it and yielding it in a memory safe fashion.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)

        # Create a generator, loading one document into memory at a time.
        for path, encoding in self.abspaths(fileids, include_encoding=True):
            with codecs.open(path, 'r', encoding=encoding) as f:
                yield f.read()

    def sizes(self, fileids=None, categories=None):
        """
        Returns a list of tuples, the fileid and size on disk of the file.
        This function is used to detect oddly large files in the corpus.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)

        # Create a generator, getting every path and computing filesize
        for path in self.abspaths(fileids):
            yield os.path.getsize(path)

#### Let's create class object. Pass the path to the directory as an argument

In [7]:
p = HTMLCorpusReader('C:/Users/79771/anaconda3/Scripts/Projects/htmlcorpus')

#### This method returns a list of categories.

In [8]:
p.categories()

['books', 'business', 'cooking', 'sports']

#### This method returns a list of all .html files regardless of categories.

In [9]:
p.fileids()

['books/Deep Learning for Coders with fastai and PyTorch_ AI Applications Without a PhD_ Howard, Jeremy, Gugger, Sylvain_ 9781492045526_ Amazon.com_ Books.html',
 'books/Life of a Klansman_ A Family History in White Supremacy_ Ball, Edward_ 9780374186326_ Amazon.com_ Books.html',
 'books/Natural Language Processing with Python_ Analyzing Text with the Natural Language Toolkit_ Bird, Steven, Klein, Ewan, Loper, Edward_ 0636920516491_ Amazon.com_ Books.html',
 'business/Выбирайте выражения_ хотел ли Дмитрий Медведев раздать деньги всем поровну.html',
 'business/Как российская «фабрика троллей» нанимала журналистов из США.html',
 'business/Худшее впереди_ Deutsche Bank допустил дальнейшее ослабление рубля из-за Навального и выборов в США.html',
 'cooking/Брауни (brownie) пошаговый рецепт с видео и фото – американская кухня_ выпечка и десерты.html',
 'cooking/Сырники из творога пошаговый рецепт с видео и фото – русская кухня_ завтраки.html',
 'cooking/Сырный суп по-французски с курицей рец

#### This method returns a list of all .html files which are located in the books and sports directory.

In [10]:
p.resolve(fileids=None,categories = ['books','sports'])

['books/Deep Learning for Coders with fastai and PyTorch_ AI Applications Without a PhD_ Howard, Jeremy, Gugger, Sylvain_ 9781492045526_ Amazon.com_ Books.html',
 'books/Life of a Klansman_ A Family History in White Supremacy_ Ball, Edward_ 9780374186326_ Amazon.com_ Books.html',
 'books/Natural Language Processing with Python_ Analyzing Text with the Natural Language Toolkit_ Bird, Steven, Klein, Ewan, Loper, Edward_ 0636920516491_ Amazon.com_ Books.html',
 'sports/24 года назад Андрей Тихонов взломал матч Кубка УЕФА_ встал в ворота на последней минуте и вытащил опасный удар в угол - 11 друзей Зинченко - Блоги - Sports.ru.html',
 'sports/Боец MMA Харитонов в 40 лет дебютирует в профессиональном боксе. Его соперник вырубал Тайсона еще в 2004-м - Панчер - Блоги - Sports.ru.html',
 'sports/В 2013-м Кокорин на месяц заскочил в «Анжи»_ положили 5 млн евро, обещали «Бугатти», перейти уговорил Денисов - Аналитика Глебчика - Блоги - Sports.ru.html']

#### This method returns the name of .html file without the categore name.

In [11]:
p.resolve(fileids='Deep Learning for Coders with fastai and PyTorch_ AI Applications Without a PhD_ Howard, Jeremy, Gugger, Sylvain_ 9781492045526_ Amazon.com_ Books.html',categories=None)

'Deep Learning for Coders with fastai and PyTorch_ AI Applications Without a PhD_ Howard, Jeremy, Gugger, Sylvain_ 9781492045526_ Amazon.com_ Books.html'

#### .docs method create a generator that returns the full text of the HTML document. Therefore  we can see that this method returns using a loop.

In [12]:
for i in p.docs(fileids=None,categories=['sports']):
    print(i)

 <!doctype html><html class="no-js"><head> <title>24 года назад Андрей Тихонов взломал матч Кубка УЕФА: встал в ворота на последней минуте и вытащил опасный удар в угол - 11 друзей Зинченко - Блоги - Sports.ru</title>  <link rel="preconnect" href="https://st.s5o.ru"> <link rel="preconnect" href="https://star.s5o.ru"> <link rel="preconnect" href="https://s5o.ru"> <link rel="preconnect" href="https://stat.sports.ru"> <link rel="preconnect" href="https://trbna.com"> <link rel="preconnect" href="https://cdn.tribuna.com"> <link rel="dns-prefetch" href="//www.googletagmanager.com"> <link rel="dns-prefetch" href="//www.google.com"> <link rel="dns-prefetch" href="//www.google-analytics.com"> <link rel="dns-prefetch" href="//mc.yandex.ru"> <link rel="dns-prefetch" href="//yastatic.net"> <link rel="dns-prefetch" href="//ls.hit.gemius.pl"> <link rel="dns-prefetch" href="//stats.g.doubleclick.net"> <link rel="dns-prefetch" href="//zen.yandex.ru"> <link rel="dns-prefetch" href="//vk.com">        <l

#### .sizes method create a generator that returns file size on disk. Therefore  we can see that this method returns using a loop.

In [17]:
for i in p.sizes(fileids=None,categories=['books']):
    print(i)

632986
623646
645277
