### Creating a categorized text corpus  
brown corpus, for example  
Categorized corpora  
The movie_reviews corpus reader is an instance of
CategorizedPlaintextCorpusReader, as is the reuters corpus reader. But where the
movie_reviews corpus only has two categories (neg and pos), reuters has 90 categories.
These corpora are often used for training and evaluating classifiers, which will be covered in
Chapter 7, Text Classification.

<img src="catgorizedTextProcessing.png" />

In [2]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [4]:
""" CategorizedPlaintextCorpusReader class, which inherits from
both PlaintextCorpusReader and CategorizedCorpusReader. These two
superclasses require three arguments: the root directory, the fileids arguments,
and a category specification:"""
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
"""CategorizedPlaintextCorpusReader('file location', r'filename_wildcard.text', cat_pattern=r'text_space.txt)"""
reader = CategorizedPlaintextCorpusReader('/Users/alessandropiccolo/nltk_data/corpora/cookbook',
                                          r'movie_.*\.txt', cat_pattern=r'movie_(\w+)\.txt')


reader.categories()

['neg', 'pos']

In [7]:
"""alternative to cat_pattern"""
reader = CategorizedPlaintextCorpusReader('/Users/alessandropiccolo/nltk_data/corpora/cookbook',
                                          r'movie_.*\.txt',
                                          cat_map={'movie_pos.txt': ['pos'], 'movie_neg.txt': ['neg']})
reader.categories()

['neg', 'pos']

In [8]:
%whos

Variable                           Type                                Data/Info
--------------------------------------------------------------------------------
CategorizedPlaintextCorpusReader   type                                <class 'nltk.corpus.reade<...>edPlaintextCorpusReader'>
brown                              CategorizedTaggedCorpusReader       <CategorizedTaggedCorpusR<...>nltk_data/corpora/brown'>
reader                             CategorizedPlaintextCorpusReader    <CategorizedPlaintextCorp<...>k_data/corpora/cookbook'>


In [5]:
"""cat_pattern keyword is passed to CategorizedCorpusReader, which overrides the
common corpus reader functions such as fileids(), words(), sents(), and paras() to
accept a categories keyword argument. This way, you could get all the pos sentences by
calling reader.sents(categories=['pos']). The CategorizedCorpusReader class
also provides the categories() function, which returns a list of all the known categories in
the corp"""
reader.fileids(categories=['neg'])


['movie_neg.txt']

In [6]:
reader.fileids(categories=['pos'])

['movie_pos.txt']

### Creating a categorized chunk corpus reader  page 69

create a class called CategorizedChunkedCorpusReader that inherits from both
CategorizedCorpusReader and ChunkedCorpusReader. It is heavily based on the
CategorizedTaggedCorpusReader class, and also provides three additional methods
for getting categorized chunks. The following code is found in catchunked.py:

> CategorizedChunkedCorpusReader class overrides all the ChunkedCorpusReader
methods to take a categories argument for locating fileids. These fileids are
found with the internal _resolve() function  This _resolve() function makes use
of CategorizedCorpusReader.fileids() to return fileids for a given list of
categories. If no categories are given, _resolve() just returns the given fileids,
which could be None, in which case all the files are read. 

In [10]:
%pwd

'/Users/alessandropiccolo/Google Drive/Python/1JupyterNotebook/NLTK'

In [29]:
%%writefile catchunked.py
from nltk.corpus.reader import CategorizedCorpusReader, ChunkedCorpusReader
from nltk.corpus.reader import ConllCorpusReader, ConllChunkCorpusReader

class CategorizedChunkedCorpusReader(CategorizedCorpusReader, ChunkedCorpusReader):
	"""
	A reader for chunked corpora whose documents are divided into categories
	based on their file identifiers.
	"""
	# code adapted from CategorizedTaggedCorpusReader
	def __init__(self, *args, **kwargs):
		CategorizedCorpusReader.__init__(self, kwargs)
		ChunkedCorpusReader.__init__(self, *args, **kwargs)
	
	def _resolve(self, fileids, categories):
		if fileids is not None and categories is not None:
			raise ValueError('Specify fileids or categories, not both')
		if categories is not None:
			return self.fileids(categories)
		else:
			return fileids
	
	def raw(self, fileids=None, categories=None):
		return ChunkedCorpusReader.raw(self, self._resolve(fileids, categories))
	
	def words(self, fileids=None, categories=None):
		return ChunkedCorpusReader.words(self, self._resolve(fileids, categories))
	
	def sents(self, fileids=None, categories=None):
		return ChunkedCorpusReader.sents(self, self._resolve(fileids, categories))
	
	def paras(self, fileids=None, categories=None):
		return ChunkedCorpusReader.paras(self, self._resolve(fileids, categories))
	
	def tagged_words(self, fileids=None, categories=None):
		return ChunkedCorpusReader.tagged_words(self, self._resolve(fileids, categories))
	
	def tagged_sents(self, fileids=None, categories=None):
		return ChunkedCorpusReader.tagged_sents(self, self._resolve(fileids, categories))
		
	def tagged_paras(self, fileids=None, categories=None):
		return ChunkedCorpusReader.tagged_paras(self, self._resolve(fileids, categories))
	
	def chunked_words(self, fileids=None, categories=None):
		return ChunkedCorpusReader.chunked_words(
			self, self._resolve(fileids, categories))
	
	def chunked_sents(self, fileids=None, categories=None):
		return ChunkedCorpusReader.chunked_sents(
			self, self._resolve(fileids, categories))
	
	def chunked_paras(self, fileids=None, categories=None):
		return ChunkedCorpusReader.chunked_paras(
			self, self._resolve(fileids, categories))
class CategorizedConllChunkCorpusReader(CategorizedCorpusReader, ConllChunkCorpusReader):
	"""
	A reader for conll chunked corpora whose documents are divided into
	categories based on their file identifiers.
	"""
	def __init__(self, *args, **kwargs):
		# NOTE: in addition to cat_pattern, ConllChunkCorpusReader also requires
		# chunk_types as third argument, which defaults to ('NP','VP','PP')
		CategorizedCorpusReader.__init__(self, kwargs)
		ConllChunkCorpusReader.__init__(self, *args, **kwargs)
	
	def _resolve(self, fileids, categories):
		if fileids is not None and categories is not None:
			raise ValueError('Specify fileids or categories, not both')
		if categories is not None:
			return self.fileids(categories)
		else:
			return fileids
	
	def raw(self, fileids=None, categories=None):
		return ConllCorpusReader.raw(self, self._resolve(fileids, categories))
	
	def words(self, fileids=None, categories=None):
		return ConllCorpusReader.words(self, self._resolve(fileids, categories))
	
	def sents(self, fileids=None, categories=None):
		return ConllCorpusReader.sents(self, self._resolve(fileids, categories))
	
	def tagged_words(self, fileids=None, categories=None):
		return ConllCorpusReader.tagged_words(self, self._resolve(fileids, categories))
	
	def tagged_sents(self, fileids=None, categories=None):
		return ConllCorpusReader.tagged_sents(self, self._resolve(fileids, categories))
	
	def chunked_words(self, fileids=None, categories=None, chunk_types=None):
		return ConllCorpusReader.chunked_words(
			self, self._resolve(fileids, categories), chunk_types)
	
	def chunked_sents(self, fileids=None, categories=None, chunk_types=None):
		return ConllCorpusReader.chunked_sents(
			self, self._resolve(fileids, categories), chunk_types)
	
	def parsed_sents(self, fileids=None, categories=None, pos_in_tree=None):
		return ConllCorpusReader.parsed_sents(
			self, self._resolve(fileids, categories), pos_in_tree)
	
	def srl_spans(self, fileids=None, categories=None):
		return ConllCorpusReader.srl_spans(self, self._resolve(fileids, categories))
	
	def srl_instances(self, fileids=None, categories=None, pos_in_tree=None, flatten=True):
		return ConllCorpusReader.srl_instances(
			self, self._resolve(fileids, categories), pos_in_tree, flatten)
	
	def iob_words(self, fileids=None, categories=None):
		return ConllCorpusReader.iob_words(self, self._resolve(fileids, categories))
	
	def iob_sents(self, fileids=None, categories=None):
		return ConllCorpusReader.iob_sents(self, self._resolve(fileids, categories))

Writing catchunked.py


### making categories out of the fileids arguments...

In [31]:
import nltk.data
from catchunked import CategorizedChunkedCorpusReader
path = nltk.data.find('corpora/treebank/tagged')
"""uses name of files to categorize while doing the chuncking"""
reader = CategorizedChunkedCorpusReader(path, r'wsj_.*\.pos', cat_pattern=r'wsj_(.*)\.pos')
len(reader.categories()) == len(reader.fileids())


True

In [32]:
len(reader.chunked_sents(categories=['0001']))

16

### chunk corpus using IOB tags  page 72 .  

In [34]:
import nltk.data
from catchunked import CategorizedConllChunkCorpusReader
path = nltk.data.find('corpora/conll2000')
reader = CategorizedConllChunkCorpusReader(path, r'.*\.txt', ('NP','VP','PP'), cat_pattern=r'(.*)\.txt')
reader.categories()

['test', 'train']

In [35]:
reader.fileids()

['test.txt', 'train.txt']

In [36]:
len(reader.chunked_sents(categories=['test']))

2012

### Loading a corpus reader can be an expensive operation due to the number of files, file sizes, and various initialization tasks
***eliminating
the overhead of loading the corpus reader immediately.***  

***LazyCorpusLoader*** class that can transform itself into your actual corpus reader as soon as you need it

In [37]:
"""LazyCorpusLoader is a proxy object which is used to stand in for a
    corpus object before the corpus is loaded.  This allows NLTK to
    create an object for each corpus, but defer the costs associated
    with loading those corpora until the first time that they're
    actually accessed."""
from nltk.corpus.util import LazyCorpusLoader
from nltk.corpus.reader import WordListCorpusReader
#arguments 'directory of corpora', 'name of reader'[filename]
reader = LazyCorpusLoader('cookbook', WordListCorpusReader,['wordlist'])

In [38]:
isinstance(reader, LazyCorpusLoader)

True

In [39]:
reader.fileids()

['wordlist']

In [40]:
"""So in the previous example code, before we call reader.fileids(), reader is an
instance of LazyCorpusLoader, but after the call, reader becomes an instance of
WordListCorp"""
isinstance(reader, LazyCorpusLoader)

False

In [41]:
isinstance(reader, WordListCorpusReader)

True