Introduction to NLP course (2017-2018).

Homework 2.2: Chunking. Parsing with context free grammar

Objectives

1) Define, train and evaluate uni-gram and bi-gram HMM chunkers
- load the conll2000 corpus
- split the corpus to test and train
- define a class for unigram chunker
- define a class for bi-gram chunker. The bi-gram chunker should backoff on the unigram.
- train a unigram and a bi-gram chunker on the train corpus.
- evaluate and compare both chunkers on the test corpus

2) Create and use a simple context free grammar for syntactic parsing
- extend the CFG given in the lectures
- load the grammar in an nltk.RecursiveDescentParser
- use the parset to tag a to corpus (given)
- for each sentence, print the number of possible parses (correct answer below)

Correct number of parses for each sentence:
- “a young woman walks in the park” <- 1 parse
- “two young men smile” <- 1 parse
- “a young woman sees two men” <- 1 parse
- “sees two men a young woman” <- 0 parses
- “a young woman sees two old men in the park with a telescope” <- AT LEAST 3 parses
- “a young woman two old men in the park with a telescope sees” <- 0 parses
- “two angry men chase a woman with a telescope” <- 2 parses
- “a woman I know owns a telescope” <- 1 parse
- “a woman I know a telescope” <- 0 parses

In [1]:
# Import section
import nltk
from nltk.corpus import conll2000

In [2]:
# Class for unigram chunker
# Takes a corpus in a pos tagged an i-o-b chunk format as input
# Parses pos-tagged corpus with the parse funciton
# Given in class
class unigram_chunker(nltk.ChunkParserI):
    
    # Initialize and train the chunker
    def __init__(self, train_sents):
        # Take the pos and the iob tags of the corpus
        # Ignore the actual words, we map from pos tag to iob tag
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        # Train an unigram tagger from the train data
        self.tagger = nltk.UnigramTagger(train_data)
    
    # Parse function
    # Takes a corpus in POS tagged format
    def parse(self,sentence):
        # Take the pos tags
        pos_tags = [pos for (word,pos) in sentence]
        # Use the tagger to tag the modified corpus
        tagged_pos_tags = self.tagger.tag(pos_tags)
        # Take the chunks from the tagged corpus
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        # Convert the output
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)]
        
        # Return the tagged sentence
        return nltk.chunk.conlltags2tree(conlltags)             

In [3]:
# Class for bigram chunker
# Takes a corpus in a pos tagged an i-o-b chunk format as input
# Parses pos-tagged corpus with the parse funciton
class bigram_chunker(nltk.ChunkParserI):
    
    # Initialize and train the chunker
    def __init__(self, train_sents, backoff_tagger):
        # Take the pos and the iob tags of the corpus
        # Ignore the actual words, we map from pos tag to iob tag
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        # Train an unigram tagger from the train data
        self.tagger = nltk.BigramTagger(train_data, backoff=backoff_tagger)
        
    # Parse function
    # Takes a corpus in POS tagged format
    def parse(self,sentence):
        # Take the pos tags
        pos_tags = [pos for (word,pos) in sentence]
        # Use the tagger to tag the modified corpus
        tagged_pos_tags = self.tagger.tag(pos_tags)
        # Take the chunks from the tagged corpus
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        # Convert the output
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)]
        
        # Return the tagged sentence
        return nltk.chunk.conlltags2tree(conlltags)

In [4]:
# Dummy function for exercise 1
def hw22_ex1():
    # Get the corpus
    train = conll2000.chunked_sents("train.txt")
    test = conll2000.chunked_sents("test.txt")
    
    # Train the two taggers:
    u_chunker = unigram_chunker(train)
    b_chunker = bigram_chunker(train, u_chunker.tagger)
    
    
    # Evaluate and print the results:
    print (u_chunker.evaluate(test))
    print (b_chunker.evaluate(test))

In [5]:
# Dummy function for exercise 2
def hw22_ex2():
    """Function for exercise 2"""
    # corpus (given)
    corpus = [['a', 'young', 'woman', 'walks', 'in', 'the', 'park'], 
['two', 'young', 'men', 'smile'], 
['a', 'young', 'woman', 'sees', 'two', 'men'], 
['sees', 'two', 'men', 'a', 'young', 'woman'], 
['a', 'young', 'woman', 'sees', 'two', 'old', 'men', 'in', 'the', 'park', 'with', 'a', 'telescope'], 
['a', 'young', 'woman', 'two', 'old', 'men', 'in', 'the', 'park', 'with', 'a', 'telescope', 'sees'], 
['two', 'angry', 'men', 'chase', 'a', 'woman', 'with', 'a', 'telescope'], 
['a', 'woman', 'I', 'know', 'owns', 'a', 'telescope'], 
['a', 'woman', 'I', 'know', 'a', 'telescope']]
    
    # Grammar (in a string format)
    grammar_string = """
    S -> NP VP 
    VP -> V NP | V NP PP | V PP | V
    PP -> P NP
    V -> "saw" | "ate" | "walked" | "walks" | "smile" | "sees" | "chase" | "know" | "owns"
    NP -> Det AN | AN
    AN -> A AN | N PP | N REL | N
    REL -> N V
    A -> "two" | "young" | "old" | "angry"
    Det -> "a" | "an" | "the" | "my"
    N -> "man" | "dog" | "cat" | "telescope" | "park" | "woman" | "men" | "John" | "Mary" | "Bob" | "I"
    P -> "in" | "on" | "by" | "with"
    """
    
    # Grammar (in nltk CFG format)
    grammar = nltk.CFG.fromstring(grammar_string)
    
    # Parse the corpus, 
    # count the number of parses for each sentence,
    # and print the sentence and the number of parses
    parser = nltk.RecursiveDescentParser(grammar, trace=0)
    
    # YOUR CODE HERE
    print("\n")
    # Print the results in the generator in a readable form
    print("\n".join([" ".join(s) +".  Parses: " + str(len(list(parser.parse(s)))) for s in corpus]))
    

In [6]:
def main():
    print ("\n------------------------------------------------------------------------")
    print ("Exercise 1: unigram and bigram chunker")
    hw22_ex1()
    print ("------------------------------------------------------------------------")
    print ("\n------------------------------------------------------------------------")
    print ("Exercise 2: number of parses with a CFG")
    hw22_ex2()

In [7]:
# Running the main function
if __name__=="__main__":
    main()


------------------------------------------------------------------------
Exercise 1: unigram and bigram chunker
ChunkParse score:
    IOB Accuracy:  86.5%%
    Precision:     74.3%%
    Recall:        86.4%%
    F-Measure:     79.9%%
ChunkParse score:
    IOB Accuracy:  89.5%%
    Precision:     81.1%%
    Recall:        86.4%%
    F-Measure:     83.7%%
------------------------------------------------------------------------

------------------------------------------------------------------------
Exercise 2: number of parses with a CFG


a young woman walks in the park.  Parses: 1
two young men smile.  Parses: 1
a young woman sees two men.  Parses: 1
sees two men a young woman.  Parses: 0
a young woman sees two old men in the park with a telescope.  Parses: 3
a young woman two old men in the park with a telescope sees.  Parses: 0
two angry men chase a woman with a telescope.  Parses: 2
a woman I know owns a telescope.  Parses: 1
a woman I know a telescope.  Parses: 0
