# COLX 521 Lecture 8: Corpus Readers

* NLTK corpus readers (A closer look)
* Classes revisited
* Generator functions
* Building corpus readers

## NLTK corpus readers (a closer look)

We've been accessing corpora in NLTK using something that looks like method of some module or object which is specific to each corpus. For example:

In [3]:
#provided code
from nltk.corpus import brown

brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

Does NLTK really have a totally separate corpus reader for all of its corpora? Actually, no. If you look at the code for the nltk.corpus module, you find the definitions of the corpus readers we have been using.

In [11]:
#provided code
# from \Lib\site-packages\nltk\corpus\__init__.py

from nltk.corpus.util import LazyCorpusLoader
from nltk.corpus.reader import *


brown = LazyCorpusLoader(
    'brown',
    CategorizedTaggedCorpusReader,
    r'c[a-z]\d\d',
    cat_file='cats.txt',
    tagset='brown',
    encoding="ascii",
)

gutenberg = LazyCorpusLoader(
    'gutenberg', PlaintextCorpusReader, r'(?!\.).*\.txt', encoding='latin1'
)

movie_reviews = LazyCorpusLoader(
    'movie_reviews',
    CategorizedPlaintextCorpusReader,
    r'(?!\.).*\.txt',
    cat_pattern=r'(neg|pos)/.*',
    encoding='ascii',
)

switchboard = LazyCorpusLoader('switchboard', SwitchboardCorpusReader, tagset='wsj')

reuters = LazyCorpusLoader(
    'reuters',
    CategorizedPlaintextCorpusReader,
    '(training|test).*',
    cat_file='cats.txt',
    encoding='ISO-8859-2',
)

movie_reviews = LazyCorpusLoader(
    'movie_reviews',
    CategorizedPlaintextCorpusReader,
    r'(?!\.).*\.txt',
    cat_pattern=r'(neg|pos)/.*',
    encoding='ascii',
)


treebank = LazyCorpusLoader(
    'treebank/combined',
    BracketParseCorpusReader,
    r'wsj_.*\.mrg',
    tagset='wsj',
    encoding='ascii',
)



They are wrapped in a "LazyCorpusLoader" object but that's just a trick to stop the corpus from being loaded before it is used; the object that actually being accessed when when we run `words()` etc. are the various corpus readers. 

In [6]:
print(brown)

<CategorizedTaggedCorpusReader in '.../corpora/brown' (not loaded yet)>


In [7]:
print(gutenberg)

<PlaintextCorpusReader in '.../corpora/gutenberg' (not loaded yet)>


In [8]:
print(treebank)

<BracketParseCorpusReader in '.../corpora/treebank/combined' (not loaded yet)>


Some corpora require special readers (e.g. Switchboard), but there are a few basic kinds that we can for our own corpora. The simpliest is the PlaintextCorpusReader. Let's create some plain text, and load it using a NLTK corpus reader

In [14]:
text = "Dr. Brooke got his Ph.D. from the Univ. of Toronto on 12/2013. Dr. Brooke's GPA was 4.1, and his thesis wasn't half-bad, if a bit too long. He went into industry...but later came back to academia."
fout = open("test.txt","w",encoding="ascii")
fout.write(text)
fout.close()

#my code here
reader = PlaintextCorpusReader("./", "test.txt",encoding="latin-1")
for sent in reader.sents():
    print(sent)
#my code here

['Dr', '.', 'Brooke', 'got', 'his', 'Ph', '.', 'D', '.', 'from', 'the', 'Univ', '.']
['of', 'Toronto', 'on', '12', '/', '2013', '.']
['Dr', '.', 'Brooke', "'", 's', 'GPA', 'was', '4', '.', '1', ',', 'and', 'his', 'thesis', 'wasn', "'", 't', 'half', '-', 'bad', ',', 'if', 'a', 'bit', 'too', 'long', '.']
['He', 'went', 'into', 'industry', '...', 'but', 'later', 'came', 'back', 'to', 'academia', '.']


Let's look at one other NLTK CorpusReader,TaggedCorpusReader, which reads a format that should be somewhat familiar by now...

In [17]:
text1 = "The/AT Fulton/NP County/NN grand/JJ jury/NN said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./"
text2 = "Lebanon/NP1 leader/NN1 builds/VVZ cabinet/NN1 ./YSTP\n"
fout = open("test1.txt","w",encoding="ascii")
fout.write(text1)
fout.close()
fout = open("test2.txt","w",encoding="ascii")
fout.write(text2)
fout.close()

#my code here
reader = TaggedCorpusReader("./", r"test\d.txt",encoding="ascii")
for sent in reader.tagged_sents():
    print(sent)
#my code here

[('the', 'AT'), ('fulton', 'NP'), ('county', 'NN'), ('grand', 'JJ'), ('jury', 'NN'), ('said', 'VBD'), ('friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '')]
[('Lebanon', 'NP1'), ('leader', 'NN1'), ('builds', 'VVZ'), ('cabinet', 'NN1'), ('.', 'YSTP')]


Note that all these Corpus Readers objects are not independent classes. They are subclasses of a general CorpusReader class which contains some methods and variables which are common to all NLTK corpus readers. You can check to see if one class is a subclass of another by using [issubclass](https://docs.python.org/3/library/functions.html#issubclass), and you can see the methods and variables of a class by using the [dir](https://docs.python.org/3/library/functions.html#dir) function.

In [19]:
issubclass(PlaintextCorpusReader,CorpusReader)

True

In [20]:
dir(CorpusReader)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_get_root',
 'abspath',
 'abspaths',
 'citation',
 'encoding',
 'ensure_loaded',
 'fileids',
 'license',
 'open',
 'readme',
 'root',
 'unicode_repr']

In [21]:
dir(TaggedCorpusReader)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_get_root',
 'abspath',
 'abspaths',
 'citation',
 'encoding',
 'ensure_loaded',
 'fileids',
 'license',
 'open',
 'paras',
 'raw',
 'readme',
 'root',
 'sents',
 'tagged_paras',
 'tagged_sents',
 'tagged_words',
 'unicode_repr',
 'words']

## Classes revisited

For the rest of this lecture we will consider how we build our own corpus reader for corpora that require special processing. There are a few pieces that we need to put together to get the kind of functionality we see in NLTK. The first is Python classes, which you've of course already built DSCI 511. Let's review this by creating a SimpleFileIO class which will allow a user of the class to write and read to documents in a particular directory (root) and with a particular encoding. Let's start with just the `__init__` function

In [23]:
class SimpleFileIO:
    
    def __init__(self, root, encoding):
        self.root = root
        self.encoding = encoding

The initialization for the person class we just wrote on a specific, somewhat arbitrary ordering of arguments. It can be easier to use functions/methods which take keywords, allowing for arbitrary ordering. Another useful aspect of keyword arguments is the ability to have defaults, allowing you to leave out arguments entirely


In [25]:
class SimpleFileIO:
    
    def __init__(self, root=".", encoding="utf-8"):
        self.root = root
        self.encoding = encoding      

In [30]:
#provided code
test1 = SimpleFileIO()
assert test1.root == "."
test2 = SimpleFileIO(encoding="latin-1")
assert test2.encoding == "latin-1"
print("Success!")

Success!


Now, let's add some methods which allow us to write a string to a given file (in our "root" directory). One way our SimpleFileIO is going to be different is that instead of just overwriting a file if it already exists, our code will just inform the user that the file existings unless an optional boolean `overwrite` has  True. We will have have a special internal function which checks if the file already exists, which we will also use for writing

In [1]:
import os

class SimpleFileIO:
    '''easy, safe reading and writing to files in a given directory'''
     
    def __init__(self, root="./", encoding="utf-8"):
        '''
        root -- the directory where the files will reside. Create this directory if it doesn't exist
        encoding -- the encoding used to open the files
        '''
        self.root = root if root.endswith("/") else root + "/"
        self.encoding = encoding
        if not os.path.exists(self.root):
            os.mkdir(self.root)
    
    def _file_exists(self, filename):
        '''see if filename exists in directory given by self.root'''
        return os.path.exists(self.root + filename)
    
    def write(self,filename,string,overwrite=False):
        ''' write a string to a file, if the file doesn't exist or overwrite is True
        
        filename -- the filename to write to
        string -- the contents of the file
        overwrite -- True if the file should be written even if it already exists'''
        if not overwrite and self._file_exists(filename):
            print(filename + " already exists!")
        else:
            fout = open(self.root + filename,"w",encoding=self.encoding)
            fout.write(string)
            fout.close()
            
    def read(self,filename):
        '''read a string from file indicated by filename, will print out a polite warning if file isn't there'''
        if not self._file_exists(filename):
            print("That file doesn't exist, sorry!")
        else:
            f = open(self.root + filename, encoding=self.encoding)
            text = f.read()
            f.close()
            return text    
            

In [3]:
#provided code
test = SimpleFileIO(root="test_dir")
test.write("test.txt","This is a test",overwrite=True)
test.write("test.txt","This shouldn't work")

test.txt already exists!


Let's add some doc strings, and access our doc strings via [help](https://docs.python.org/3/library/functions.html#help)

In [4]:
help(test)

Help on SimpleFileIO in module __main__ object:

class SimpleFileIO(builtins.object)
 |  easy, safe reading and writing to files in a given directory
 |  
 |  Methods defined here:
 |  
 |  __init__(self, root='./', encoding='utf-8')
 |      root -- the directory where the files will reside. Create this directory if it doesn't exist
 |      encoding -- the encoding used to open the files
 |  
 |  read(self, filename)
 |  
 |  write(self, filename, string, overwrite=False)
 |      write a string to a file, if the file doesn't exist or overwrite is True
 |      
 |      filename -- the filename to write to
 |      string -- the contents of the file
 |      overwrite -- True if the file should be written even if it already exists
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defi

Exercise: Add a `read` method to the class which is analagous to write. It should print out a warning rather than throwing an exception if the file does not exist (it should use \_file_exists). Write some tests for your method. Then add a doc string.

In [55]:
test.write("test.txt","This is a test",overwrite=True)
print(test.read("test.txt"))
test.read("test2.txt")

This is a test
That file doesn't exist, sorry!


## Generator functions

When we use a corpus reader, we are able to iterate through the words or sentences of an entire corpus just as if the corpus reader had provided us with a list. But the result of words or sents is not actually a list, it is a generator function. The main advantage of an generator is that doesn't have to store everything it intends to return in memory all at once; very important when dealing with large corpora! Generators are used almost exclusively as the iterated objects in for loops. An example of a (built-in) generator in Python 3 is the *range* function (note that in Python 2 *range* did return a list!).

In [5]:
#provided code
a_big_range = range(10000000000000000000000000000000000000000000000000000)
for i in a_big_range:
    print(i)
    if i == 10:
        break

0
1
2
3
4
5
6
7
8
9
10


The trick behind a generator is a function which has a *yield* statment rather than a *return* statement. *return* only happens once in any call to a function, but a *yield* is typically executed mutiple times. Each time a yield statement is executed, the object is returned to the calling function, and the code pauses (remembering the current state) until another object is required by the calling function, at which point it picks up where it left off in the code. In nearly all circumstances, a yield statment will be inside a loop. As a useless example of this, let's create a my_range function which does the same thing as range, using a while loop.

In [4]:
def my_range(n):
    i = 0
    #my code here
    while i < n:
        yield i
        i+= 1
    #my code here
        
for i in my_range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


A slightly more useful example: Let's write an generator that offers up the first n Fibonocci numbers. The Fibonnocci sequence is calculated by summing the previous two numbers in the sequence, starting with a "base case" of 0 and 1 (it's typically solved with recursion)...

In [9]:
def fibonacci(n):
    prev = 1
    prev_prev = 0
    count = 0
    #my code here
    yield 0
    yield 1
    while count < n:
        new = prev + prev_prev
        yield new
        prev_prev = prev
        prev = new
        count += 1
    #my code here

for fib_num in fibonacci(20):
    print(fib_num)


0
1
1
2
3
5
8
13
21
34
55
89
144
233
377
610
987
1597
2584
4181
6765
10946


Exercise: Build a generator which offers up a range from zero to a suppiled n except it skips numbers which have the digit 7 in them or are a multiple of 7.

In [8]:
def non_seven_nums(n):
    # your code here
    i = 0
    while i < n:
        if "7" not in str(i) and not i % 7 == 0:
            yield i
        i+= 1
    # your code here
            
for non_seven_num in non_seven_nums(100):
    print(non_seven_num)      

1
2
3
4
5
6
8
9
10
11
12
13
15
16
18
19
20
22
23
24
25
26
29
30
31
32
33
34
36
38
39
40
41
43
44
45
46
48
50
51
52
53
54
55
58
59
60
61
62
64
65
66
68
69
80
81
82
83
85
86
88
89
90
92
93
94
95
96
99


It's fairly common to have generators inside generators. Both generators would have yields, and the outer generator would iterator over the former one.

In [5]:
def squares(n):
    # my code here
    for num in my_range(n):
        yield num**2
    # my code here


for num in squares(10):
    print(num)

0
1
4
9
16
25
36
49
64
81


One more key difference between generators and regular functions. When you call a regular function (using parenthesis) in any context, it is immediately evaluated and the result returned. This is not true for generators, a generator can be "called" (actually just initialized) outside a for loop, assigned, and then used later in a for loop.

In [82]:
a = sum([1,2,3,4])
a

10

In [6]:
a = my_range(5)
a

<generator object my_range at 0x000002530B6C90A0>

In [7]:
for num in a:
    print(num)

0
1
2
3
4


## Building corpus readers

A good corpus reader uses classes and generator functions to seamlessly offer up the contents of a large corpus in a consistent pre-processed format to some external user without requiring a huge memory footprint. Normally the corpus involved would be text files on disk, but we could even have a corpus reader that pulls an entire corpus from the web as part of its reading process. Let's do exactly that with some provided pages from the Master of Data Science website. For now, our corpus reader will just offer up words.

In [64]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from nltk import word_tokenize
pages = ["https://masterdatascience.ubc.ca/",
         "https://masterdatascience.ubc.ca/programs/computational-linguistics",
         "https://masterdatascience.ubc.ca/admissions/frequently-asked-questions",
         "https://masterdatascience.ubc.ca/why-ubc/our-campuses"]

class MDS_corpus_reader:
    
    def words(self):
        # my code here
        for page in pages:
            soup = BeautifulSoup(urlopen(page), "lxml")
            words = word_tokenize(soup.get_text())
            for word in words:
                yield word

        #my code here

Now let's use this corpus reader to build a dictionary of counts

In [65]:
from collections import Counter

reader = MDS_corpus_reader()
counts = Counter(reader.words())
print(counts.most_common(50))

[("''", 698), (',', 604), ('.', 228), (':', 213), ('and', 203), ('of', 164), ('the', 163), ('to', 135), (':1', 128), ('(', 115), (')', 115), ('Data', 77), ('a', 77), ('UBC', 71), ('for', 70), ('>', 62), ('s', 61), ('in', 61), ('!', 60), ('Science', 57), ('<', 57), ('?', 56), ('--', 52), ('{', 51), ('}', 51), (';', 50), ('[', 49), (']', 49), ('program', 49), ('``', 48), ('MDS', 46), ('The', 45), ('is', 44), ('with', 44), ('data', 40), ('//', 39), ("'", 37), ('Vancouver', 37), ('Master', 33), ('Computational', 33), ('Okanagan', 32), ('are', 32), ('on', 32), ('students', 31), (':0', 30), ('campus', 30), ('Linguistics', 29), ('at', 29), ('Students', 26), ('’', 26)]


Exercise: modify the corpus reader so that it has an option to remove stopwords and nonalphabetic words, and then test it out by building another dictionary of counts; does it look more useful?

In [78]:
from nltk import sent_tokenize
from nltk.corpus import stopwords

stopwords_set = set(stopwords.words("english"))

class MDS_corpus_reader(object):
    
    def words(self,remove_stopwords=False,remove_non_alpha=False):
        for page in pages:
            soup = BeautifulSoup(urlopen(page),"lxml")
            # your code here
            words = word_tokenize(soup.get_text())
            for word in words:
                if (not remove_stopwords or word not in stopwords_set) and (not remove_non_alpha or word.isalpha()):
                    yield word
            # your code here


In [80]:
from collections import Counter

reader = MDS_corpus_reader()
counts = Counter(reader.words(remove_stopwords=True,remove_non_alpha=True))
print(counts.most_common(50))

[('Data', 77), ('UBC', 71), ('Science', 57), ('program', 49), ('MDS', 46), ('The', 45), ('data', 40), ('Vancouver', 37), ('Master', 33), ('Computational', 33), ('Okanagan', 32), ('students', 31), ('campus', 30), ('Linguistics', 29), ('Students', 26), ('courses', 26), ('Instructor', 26), ('Alumni', 22), ('Why', 20), ('language', 19), ('I', 18), ('Stories', 17), ('Subscribe', 17), ('l', 16), ('University', 15), ('British', 15), ('Columbia', 15), ('In', 14), ('Questions', 14), ('Career', 14), ('Us', 14), ('analysis', 14), ('learning', 14), ('experience', 14), ('CDATA', 13), ('Frequently', 13), ('Asked', 13), ('Our', 13), ('application', 13), ('w', 12), ('var', 12), ('link', 12), ('Search', 12), ('Student', 12), ('Campuses', 12), ('How', 12), ('What', 12), ('false', 11), ('take', 11), ('Tuition', 11)]
