Introduction to NLP course (2017-2018).

Homework 1: Tokenization and corpus statistics

Objectives:

1) Load and tokenize the treebank corpus from NLTK using regexp_tokenizer
- obtain the corpus using get_corpus_t1()
- obtain the gold standard using get_gold_tokens()
- extend the existing regexp grammar to improve its coverage
- modify the corpus prior to the tokenization (if needed)
- tokenize the corpus with regexp_tokenize()
- evaluate the tokenization using evaluate_t1()
- improve the regexp grammar until satisfied with the result

2) Print basic statistics for the corpus (after the tokenization)
- The number of tokens in the corpus
- The number of types in the corpus (case insensitive!)
- The number of hapaxes - tokens that appear in the corpus only once (case insensitive!) 
- The most frequent types with length >=5 
- The average token length
- The most frequent token length in the corpus
- The number of bi-, tri-, and five-grams in the corpus (you need to write your own function for extracting five-grams);
- The most frequent bi- and tri-grams that do NOT contain punctuation (for the task, assume punctuation to be , . ! ? )
- The most frequent five-grams
- The percentage of bi-,tri-, and five-grams that appear only once
- The 10 most frequent collocates of "man" and "woman" in the corpus, within a window of 4
- The 10 most frequent collocates of "man" and "woman", with a frequency of 5 or more, according to the PPMI score (within a window of 4)


In [204]:
# Import section

# Import nltk
import nltk
from nltk import word_tokenize
from nltk import regexp_tokenize
from nltk import FreqDist
from nltk import bigrams, trigrams
from nltk.collocations import *
import numpy as np
from nltk.corpus import stopwords

# Import regular expressions
import re

# Import corpora
from nltk.corpus import treebank_raw

In [3]:
## Functions given in the task
## You should not change anything here
def get_corpus_t1(nr_files=199):
    """Returns the raw corpus as a long string.
    'nr_files' says how much of the corpus is returned;
    default is 199, which is the whole corpus.
    """
    fileids = nltk.corpus.treebank_raw.fileids()[:nr_files]
    corpus_text = nltk.corpus.treebank_raw.raw(fileids)
    # Get rid of the ".START" text in the beginning	of each file:
    corpus_text = corpus_text.replace(".START", "")
    return corpus_text

def fix_gold_tokens(tokens):
    """Replace tokens so that they are similar to the raw corpus text."""
    return [token.replace("''", '"').replace("``",'"').replace(r"\/", "/") for token in tokens]

def get_gold_tokens(nr_files=199):
    """Returns the gold corpus as a list of strings.
    'nr_files' says how much of the corpus is returned;
    default is 199, which is the whole corpus.
    """
    fileids = nltk.corpus.treebank_chunk.fileids()[:nr_files]
    gold_tokens = nltk.corpus.treebank_chunk.words(fileids)
    return fix_gold_tokens(gold_tokens)

def evaluate_t1(test_tokens, gold_tokens):
    """Finds the chunks where test_tokens differs from gold_tokens.
    Prints the errors and calculates similarity measures.
    """
    import difflib
    matcher = difflib.SequenceMatcher()
    matcher.set_seqs(test_tokens, gold_tokens)
    error_chunks = true_positives = false_positives = false_negatives = 0
    print(" Token%30s | %-30sToken" % ("Error", "Correct"))
    print("-" * 38 + "+" + "-" * 38)
    for difftype, test_from, test_to, gold_from, gold_to in matcher.get_opcodes():
        if difftype == "equal":
            true_positives += test_to - test_from
        else:
            false_positives += test_to - test_from
            false_negatives += gold_to - gold_from
            error_chunks += 1
            test_chunk = " ".join(test_tokens[test_from:test_to])
            gold_chunk = " ".join(gold_tokens[gold_from:gold_to])
            print("%6d%30s | %-30s%d" % (test_from,test_chunk, gold_chunk, gold_from))
    precision = 1.0 * true_positives / (true_positives + false_positives)
    recall = 1.0 * true_positives / (true_positives+ false_negatives)
    fscore = 2.0 * precision * recall / (precision+ recall)
    print()
    print("Test size: %5d tokens" % len(test_tokens))
    print("Gold size: %5d tokens" % len(gold_tokens))
    print("Nr errors: %5d chunks" % error_chunks)
    print("Precision: %5.2f %%" % (100 * precision))
    print("Recall: %5.2f %%" % (100 * recall))
    print("F-score: %5.2f %%" % (100 * fscore))
    print()

In [193]:
# HOMEWORK 1. PART 1.
# Dummy function
# Feel free to make it more verbose and include prints/status updates
def hw1_part1():
    # Get the corpus
    corpus = get_corpus_t1()
    # Get the gold standard
    gold_tokens = get_gold_tokens()

    # Initial regular expression grammar
    # You need to modify it so that you can improve the performance of the tokenizer
    re_grammar = r'''(?x) # set flag to allow verbose regexps
    \'[a-z]+              # 's, 've, 're
    |has
    |should
    |would
    |have
    |did
    |are
    |was
    |could
    |does
    |were
    |had
    |n\'t
    |[A-Z]\'[A-Z][a-z]+   # O'Connor, C'Mon, ...
    |\w+/\w+              # fractions and words separated by /, e.g 3/8 left/right
    |[0-9]+(?:,[0-9]+)+   # separate numbers with arbitrary numbers of commas, e.g. 11,390,000
    |[A-Z]+&[A-Z]+        # S&P AT&T
    |[0-9]+\.[0-9]+       # plain decimal numbers e.g. 9.7
    |[A-Z][a-z]{1,3}\.    # abbreviations, e.g Nov., Mr.
    |(?:[A-Za-z]\.)+         # abbreviations, e.g. U.S.A., p.m, v.
    | \w+(?:-\w+)*        # words with optional internal hyphens
    | \.\.\.              # ellipsis
    | [][.,;"'?():-_`%&#{}!$]    # these are separate tokens; includes ], [
    |--                   
    ''' 
    
    # Modify the corpus prior to tokenization here, if necessary
    
    # Tokenize the corpus
    test_tokens = regexp_tokenize(corpus, re_grammar)
    
    # Evaluate the results
    evaluate_t1(test_tokens,gold_tokens)
    
    return(test_tokens)

In [202]:
# HOMEWORK 1. PART 2.
# Dummy function
# Feel free to make it more verbose and include prints/status updates
def hw1_part2(tokens):
    # 1. The number of tokens in the corpus
    print "1) number of tokens:", len(tokens)
    
    # 2. The number of types in the corpus (case insensitive!)
    lowercase_tokens = [el.lower() for el in tokens]
    print "2) number of unique tokens:", len(set(lowercase_tokens))
    
    # 3. The number of hapaxes - tokens that appear in the corpus only once (case insensitive!)
    print "3) number of hapaxes:", len(FreqDist(lowercase_tokens).hapaxes())

    # 4. The most frequent types with length >=5
    tokens_5 = [t for t in tokens if len(t) >= 5]
    print "4) most frequent types with length >= 5:", FreqDist(tokens_5).most_common(3)
    
    # 5. The average token length
    token_lengths = [len(el) for el in tokens]
    print "5) average token length:", np.sum(token_lengths) / (1.0 * len(tokens))
    
    # 6. The most frequent token length in the corpus
    print "6) most frequent token lenth:", FreqDist(token_lengths).max()
    
    # 7. The number of bi-, tri-, and five-grams in the corpus
    bigr_list = list(bigrams(tokens))
    trgr_list = list(trigrams(tokens))
    fivegram_list = [bi + tri for bi, tri in zip(bigr_list[:-2], trgr_list[2:])]
    print "7) number of bi-, tri- and five-grams: %d, %d, %d" % (len(bigr_list), len(trgr_list), len(fivegram_list))
    
    # 8. The most frequent bi- and tri-grams that do NOT contain punctuation (for the task, assume punctuation to be , . ! ? )    
    punctuation_filter = lambda x: not("," in x or "." in x or "!" in x or "?" in x)
    print "8.a) most frequent bi-grams without punctuation:\n"
    print FreqDist(filter(punctuation_filter, bigr_list)).most_common(5)
    print "8.b) most frequent tri-grams without punctuation:\n"
    print FreqDist(filter(punctuation_filter, trgr_list)).most_common(5)
    
    #9. The most frequent five-grams
    print "9) most frequent five-grams:\n"
    print FreqDist(fivegram_list).most_common(5)
    
    # 10. The percentage of bi-,tri-, and five-grams that appear only once
    def percentage_1_time_grams(grams_list):
        """
        computes the percentage of n-grams that appear only once
        """
        distr_grams = FreqDist(grams_list)
        count = 0.0
        for sample in distr_grams:
            if distr_grams[sample] == 1:
                count += 1
        return count / distr_grams.N() * 100
    
    print "10) percentage of bi-, tri, and five-grams that appear only once:" 
    print "%.2f, %.2f, %.2f\n" % (percentage_1_time_grams(bigr_list), percentage_1_time_grams(trgr_list), percentage_1_time_grams(fivegram_list)) 
    
    # 11. The 10 most frequent collocates of "man" and "woman" in the corpus, within a window of 4
    # a better approach using nltk methods is shown below
    def n_frequent_collocates_of(tokens, word, n, window_size):
        """
        finds the top 'n' collocates of 'word' within a window of size 'window_size'
        """
        cooc = BigramCollocationFinder.from_words(tokens, window_size)
        pairs_list = []
        for pair,freq in cooc.ngram_fd.items():
            if word in pair:
                pairs_list.append((pair, freq))
        pairs_list.sort(key=lambda x: -x[1]) # sort the list on descending frequency
        
        return pairs_list[:15]
    
    print "11.a) top 10 most frequent collocates of \"man\":"
    for pair, freq in n_frequent_collocates_of(tokens, word="man", n=10, window_size=5):
        print pair, freq
    print "11.b) top 10 most frequent collocates of \"woman\":"
    for pair, freq in n_frequent_collocates_of(tokens, word="woman", n=10, window_size=5):
        print pair, freq
    
    print "---------alternative way of doing 11)--------------"
    
    def collocates_of(word, tokens, window_size):
        """
        finds collocates of "word" in a window of size "window size"
        """
        finder = BigramCollocationFinder.from_words(tokens, window_size)
        word_filter = lambda w1, w2: word not in (w1, w2)
        finder.apply_ngram_filter(word_filter)
        return finder
    
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    man_finder = collocates_of(word='man', tokens=tokens, window_size=5)
    woman_finder = collocates_of(word='woman', tokens=tokens, window_size=5)
    
    print man_finder.nbest(bigram_measures.raw_freq, 10)
    print woman_finder.nbest(bigram_measures.raw_freq, 10)
        
    # 12. The 10 most frequent collocates of "man" and "woman", with a frequency of 5 or more, according to the PPMI score (within a window of 4)
    print "12) top 10 collocates of \"man\" and \"woman\" with a frequency of 5 or more, according to the PMI score"
    # apply filter from the results of point 11.
    man_finder.apply_freq_filter(5)
    print man_finder.nbest(bigram_measures.pmi, 10)
    woman_finder.apply_freq_filter(5)
    print woman_finder.nbest(bigram_measures.pmi, 10)
    

In [203]:
# Main program
tokens = hw1_part1()
hw1_part2(tokens)

 Token                         Error | Correct                       Token
--------------------------------------+--------------------------------------
   724                        are a. | area .                        724
   755                         Vose. | Vose .                        755
   815                  IBC/Donoghue | IBC                           816
  1297                          S.p. | S.p .                         1298
  1441                        isn 't | is n't                        1443
  1495                         Corp. | Corp .                        1497
  1640                        isn 't | is n't                        1643
  2752                         Conn. | Conn .                        2755
  2912                         . . . | ...                           2916
  3325                         . . . | ...                           3327
  3982                         Cray. | Cray .                        3982
  4352                        don 't

5) average token length: 4.47007040384
6) most frequent token lenth: 3
7) number of bi-, tri- and five-grams: 94170, 94169, 94167
8.a) most frequent bi-grams without punctuation:

[((u'of', u'the'), 486), ((u'in', u'the'), 363), ((u'for', u'the'), 171), ((u'to', u'the'), 170), ((u'on', u'the'), 149)]
8.b) most frequent tri-grams without punctuation:

[((u'the', u'company', u"'s"), 31), ((u'cents', u'a', u'share'), 31), ((u'in', u'the', u'U.S.'), 28), ((u'New', u'York', u'Stock'), 26), ((u'York', u'Stock', u'Exchange'), 26)]
9) most frequent five-grams:

[((u'the', u'New', u'York', u'Stock', u'Exchange'), 14), ((u',', u'the', u'company', u'said', u'.'), 12), ((u',', u'"', u'he', u'said', u'.'), 11), ((u'president', u'and', u'chief', u'executive', u'officer'), 11), ((u',', u'"', u'he', u'says', u'.'), 10)]
10) percentage of bi-, tri, and five-grams that appear only once:
46.83, 81.50, 97.55

11.a) top 10 most frequent collocates of "man":
(u',', u'man') 4
(u'.', u'man') 3
(u'man', u'.') 