Linguist 278: Programming for Linguists<br />
Stanford Linguistics, Fall 2020<br />
Christopher Potts

# Assignment 8

Distributed 2020-11-09<br />
Due 2020-11-16

Please submit a modified version of this file with the questions completed.

In [1]:
from collections import defaultdict
import glob
import os
import pandas as pd
import string

## SimpleTokenizer [2 points]

For assignment 2, question 1, you wrote a function called `simple_tokenize` for tokenizing text. This question asks you to convert that function to a method on a `SimpleTokenizer` class.  The function should use the same core logic as `simple_tokenize`, but should also honor the optional class parameter `lower` allowing the user to decide whether to downcase all of the tokens. 

In [9]:
class SimpleTokenizer:

    PUNCT = string.punctuation

    def __init__(self, lower=True):
        self.lower = lower

    def tokenize(self, s):
        """Break str `s` into a list of str.

        1. `s` has all of its peripheral whitespace removed.
        2. `s` is downcased if `self.lower` is True, otherwise not.
        3. `s` is split on whitespace.
        4. For each token, any peripheral punctuation on it is stripped
           off. Punctuation is here defined by `string.punctuation`.

        Parameters
        ----------
        s : str
            The string to tokenize.

        Returns
        -------
            list of str
        """
        lst = []
        s = s.strip()
        if self.lower == True:
          s = s.lower()
        s = s.split(" ")
        for elem in s:
          elem = elem.strip(self.PUNCT)
          lst.append(elem)
        return lst



In [10]:
def test_simple_tokenizer():
    lower = True
    no_lower = False
    examples = [
        ["The dog barked.", ["the", "dog", "barked"], lower],
        ["The dog barked.", ["The", "dog", "barked"], no_lower],
        ['"Hello?", she said.', ["hello", "she", "said"], lower],
        ["A non-issue.", ["a", "non-issue"], lower]
    ]
    err_count = 0
    for x, expected, lower in examples:
        tokenizer = SimpleTokenizer(lower)
        result = tokenizer.tokenize(x)
        if result != expected:
            print('simple_tokenize error for "{}":\n\tGot: {}\n\tExpected: {}'.format(
                x, result, expected))
            err_count += 1
    print("test_simple_tokenize completed with {} errors".format(err_count))

In [11]:
test_simple_tokenizer()

test_simple_tokenize completed with 0 errors


## Age of Acqusition class [2 points]

The hackathon introduced the [Age-of-acquisition ratings for 30 thousand English words](https://www.humanities.mcmaster.ca/~vickup/Kuperman-BRM-2012.pdf) (Victor Kuperman, Hans Stadthagen-Gonzalez, and Marc Brysbaert, *Behavior Research Methods*, 2014), which has the following columns:

0. `Word`: The word (str)
1. `OccurTotal`: token count in their data
2. `OccurNum`: Participants who gave an age-of-acquisition, rather than saying "Unknown"
3. `Rating.Mean`: mean age of aquisition in years of age
4. `Rating.SD`: standard deviation of the distribution of ages of acquisition

Complete following class definition according to the specifications given by the docstrings and other comments.

In [7]:
class AoA:
    def __init__(self, filename_or_url):
        """Class for working with the Age-of-Acquisition (AoA) dataset.

        Parameters
        ----------
        filename_or_url : str
            Full path to the file on the local machine or
            a URL pointing to the file.

        Attributes
        ----------
        df : pd.DataFrame
            The dataset as read in by pandas, with "Word" as the index.
        word_set : set
            The set of words in the dataset.

        """
        self.filename_or_url = filename_or_url

        # Complete this so that the spreadsheet is read in as a `pd.DataFrame`
        # with the 'Word' column providing the index:
        
        self.df = pd.read_csv(self.filename_or_url, index_col = "Word")

        # Complete this so that it is defined as the set of words in the dataset:
        # the set in the index of `self.df`.
        ## TO BE COMPLETED ##
        self.word_set = set(self.df.index)


    def mean_rating_mean(self, word_or_word_list):
        """Return the mean of the "Rating.Mean" values of the word or
        words given in `word_or_word_list`.

        Parameters
        ----------
        word_or_word_list : str or list of str
            The words to look-up. All the words provided can be assumed to be
            in the dataset.
        """
        return self.df.loc[word_or_word_list]["Rating.Mean"].mean()


    def foo(self, filename_or_url):
        """Think of something original for this method, and rename it so that
        it matches what it does. Your function can have as many required and
        optional arguments as you wish (including none), and it can do whatever
        you like withe the dataset. No need to get carried away; I am assuming
        it will have about the complexity of `mean_rating_mean`.
        """
        return self.df.loc["Rating.SD"]["Rating.Mean"].mean()


In [24]:
df = pd.read_csv("http://web.stanford.edu/class/linguist278/data/hackathon/Kuperman-BRM-data-2012.csv")


In [6]:
def test_aoa():
    aoa = AoA("http://web.stanford.edu/class/linguist278/data/hackathon/Kuperman-BRM-data-2012.csv")
    err_count = 0
    if not hasattr(aoa, "df"):
        print("The AoA class should have an attribute `df`.")
        err_count += 1
    elif not isinstance(aoa.df, pd.DataFrame):
        print("The type of the `df` attribute should be `pd.DataFrame`.")
        err_count += 1
    if not hasattr(aoa, "word_set"):
        print("The AoA class should have an attribute `word_set`.")
        err_count += 1
    elif not isinstance(aoa.word_set, set):
        print("The type of the `word_set` attribute should be `set`.")
        err_count += 1
    examples = ['dog', 'canine']
    result = aoa.mean_rating_mean(examples)
    expected = 5.625
    if result != expected:
        print("Error for `mean_rating_mean`: for {}, expected {} but got {}".format(
            examples, expected, result))

In [7]:
test_aoa()

The type of the `df` attribute should be `pd.DataFrame`.
The type of the `word_set` attribute should be `set`.
Error for `mean_rating_mean`: for ['dog', 'canine'], expected 5.625 but got None


## Brown corpus reader [6 points]

This question focuses on building two basic Python classes for processing and working with [the Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus), which is a famous early part-of-speech tagged corpus. Please download the corpus from here:

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip

The corpus consists of 500 text files. Each file contains a list of sentences. Each sentence is given on its own line, with no linebreaks. (The files also have a lot of blank lines reflecting passage structure that we will ignore.)

The sentences themselves look like this:

> Implementation/nn of/in Georgia's/np$ automobile/nn title/nn law/nn was/bedz also/rb recommended/vbn by/in the/at outgoing/jj jury/nn ./.

The words have part-of-speech (POS) tags on them. The separator is a forward slash, `/`. So for example, `of/in` is the word "of" tagged as a preposition (tag `in`). You don't really need to know what the tags mean for this assignment, but you can find a full glossary of them [here](https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used).

For this question, you just need to complete the classes for `BrownCorpus` and `BrownSentence` according to the docstrings. 

The intuitive idea is that a `BrownCorpus` instance can read in the corpus files and turn the (non-blank) lines in those files into `BrownSentence` instances, where a `BrownSentence` instance is a list of (word, tag) pairs.

Below the class definitions, I've included some `assert` statements that you can use to test your code. Feel free to modify these (e.g., by having them print useful messages). 

In [67]:
d = {'key': 1}


In [71]:
d['key2'] = 2
d['key3'] = {'key3_1' : 0}
d

{'key': 1, 'key2': 2, 'key3': {'key3_1': 0}}

In [143]:
class BrownCorpus:
    def __init__(self, src_dirname="brown"):
        """This init method should create two attributes:

        Attributes
        ----------
        self.src_dirname : str
            The `src_dirname` argument stored as an attribute.

        self.src_filenames : list of str
            The full list of corpus filenames. You can get these by
            using `glob.glob` on `self.src_dirname`. Note: the corpus
            directory contains a few extra metadata files that need
            to be filtered out. Hint: all and only the true corpus
            filenames end in a digit.

        """
        self.src_dirname = src_dirname
        self.src_filenames = glob.glob(self.src_dirname+'/*[0-9]') 


    def iter_sentences(self):
        """This is a method for iterating over the corpus sentences.
        It should loop over all the files in `self.src_filenames`, open
        each one, iterate through its lines, and turn all the non-blank
        lines into `BrownSentence` instances, which it should yield.
        Note: `BrownSentence` instances are initializde with a str
        and the filename that tracks their origin.

        Yields
        ------
        BrownSentence
        """
        lst = []
        for filename in self.src_filenames:
            with open (filename,'r') as f:
                for line in f:
                    if line !='\n':
                        sentence = BrownSentence(line, filename)
                        lst.append(sentence)
        return lst
           
                    
            
        
        
        


    def get_pos_distributions(self):
        """This method returns a two-dimensional count dict mapping
        words to dicts mapping tags to counts. The idea is that this
        makes it easy to see how many different POS tags a word
        appears with, to get its most or least common tag, etc.

        Returns
        -------
        dict mapping str to dicts mapping str to int

        """
        ## TO BE COMPLETED ##
    
        d = {}
        sentences = self.iter_sentences()
        for sentence in sentences:
            for i in range(len(sentence.words)):
                word = sentence.words[i]
                tag = sentence.tags[i]
                if word in d:
                    if tag in d[word]:
                        d[word][tag]+=1
                    else:
                        d[word][tag]=1
                else:
                    d[word] = {tag: 1}
        return d

class BrownSentence:
    def __init__(self, raw_string, src_filename):
        """This init method should create four attributes:
    
        Attributes
        ----------
        self.raw_string : str
            Identical to the `raw_string` argument. The presumption
            of this class is that `raw_string` is a non-blank line
            from a Brown corpus file.

        self.src_filename : str
            The filename of the file that contains this sentence.
            Identical to the argument `src_filename`.

        self.lemmas : list of tuple
            Created by the method `get_lemmas` defined below.

        self.words : list of str
            Derived from `self.lemmas`, as the list of the first
            members of those tuples. For instance, if

            self.lemmas = [('the', 'at', 'cat', 'nn', 'sat', 'vbd')]

            then

            self.words == ['the', 'cat', 'sat']
            

        self.tags : list of str
            Derived from `self.lemmas`, as the list of the second
            members of those tuples. For instance, if

            self.lemmas = [('the', 'at', 'cat', 'nn', 'sat', 'vbd')]

            then

            self.words == ['at', 'nn', 'vbd']

        """
        ## TO BE COMPLETED ##
        
        self.raw_string = raw_string
        self.src_filename = src_filename
        self.lemmas = self.get_lemmas()
        
        self.tags = []
        self.words = []
        for lemma in self.lemmas:
            self.words.append(lemma[0])
            self.tags.append(lemma[1])
        
            
        

    def get_lemmas(self):
        """This method creates a list of lemmas from `self.raw_string`.
        A lemma is a pair consisting of a word token and a
        part-of-speech (POS) tag.  In the Brown corpus, the lemmas of
        a sentence are separated by whitespace, and the word and POS
        of each lemma as separated by a /. If a lemma contains multiple
        / characters, it will be the rightmost one that separates the
        word from the POS tag. For example,

        "2-1/2/cd"

        should be parsed as ('2-1/2', 'cd'). Check out the str method
        `rsplit` for help with this.

        Returns
        -------
        list of tuple

        """
        ## TO BE COMPLETED ##
        words = self.raw_string.strip().split(' ')
        lst = []
        for word in words:
            if word == "":
                continue
            lst.append(tuple(word.rsplit("/", maxsplit = 1)))
        return lst
        
    
            
    


    def __len__(self):
        """Defined as len(self.lemmas)"""
        return len(self.lemmas) 

In [144]:
sent = BrownSentence("  the/at  cat/nn ate/vbd  and/or/cc  slept/vbd ./.", "foo.txt")

In [145]:
sent.lemmas

[('the', 'at'),
 ('cat', 'nn'),
 ('ate', 'vbd'),
 ('and/or', 'cc'),
 ('slept', 'vbd'),
 ('.', '.')]

In [146]:
sent.words

['the', 'cat', 'ate', 'and/or', 'slept', '.']

In [147]:
sent.tags

['at', 'nn', 'vbd', 'cc', 'vbd', '.']

In [148]:
assert sent.raw_string.strip() == "the/at  cat/nn ate/vbd  and/or/cc  slept/vbd ./."

In [149]:
assert sent.src_filename == "foo.txt"

In [150]:
assert sent.lemmas == [('the', 'at'), ('cat', 'nn'), ('ate', 'vbd'),
                       ('and/or', 'cc'), ('slept', 'vbd'), ('.', '.')]

In [151]:
assert sent.words == ['the', 'cat', 'ate', 'and/or', 'slept', '.']

In [152]:
assert sent.tags == ['at', 'nn', 'vbd', 'cc', 'vbd', '.']

In [153]:
assert len(sent) == 6

In [154]:
# This code assumes the corpus is in a directory called "brown"
# in the same directory as this notebook. Feel free to put it
# somewhere else if you prefer.

corpus = BrownCorpus(src_dirname="brown")

In [155]:
assert len(corpus.src_filenames) == 500

In [156]:
dist = corpus.get_pos_distributions()

In [157]:
assert dist['commented'] == {'vbd': 16, 'vbn': 2}

In [158]:
assert dist['!'] == {'.': 1590, '.-hl': 6}