Problem Set 4: Sequence labeling
=====================

This project focuses on sequence labeling, in the target domain of Twitter part-of-speech tagging.
Part (b) focuses on *discriminative* approaches, mainly averaged perceptron and structured perceptron.


###Submission guidelines:###

Here are some submission guidelines for the problem set submission on t-square. Please try to adhere to them as it makes grading simpler.

* Submit these 3 things on tsquare: 

   * compressed gtnlplib folder containing all your code. Please don't attach all python files separately to t-square. 
    
   * pset4.ipynb to present all your explanation answers and results.
    
   * There will be multiple response files that will be generated throughout the assignment. 4 for your normal models on dev data and 1 for bake off on test data. Use createSubmission.sh script to compress these files and submit the generated response_files.tar on Tsquare.


   * For 'Error Analysis' part write your answers in the notebook only. If you want to point to any code/functions that you have written separately, please point the location of code in the notebook file.

* Please don't modify any of the relative paths to data. You can copy the 'data' folder according to the given relatove path in the 'gtnlplib/constants.py' while working through the assignment.

In [1]:
import operator
from collections import defaultdict, Counter
import matplotlib.pyplot as plt
%pylab --no-import-all inline

import gtnlplib.preproc
import gtnlplib.viterbi
import gtnlplib.clf_base
import gtnlplib.scorer
import gtnlplib.constants
import gtnlplib.features
import gtnlplib.tagger_base
import gtnlplib.avg_perceptron
import gtnlplib.str_perceptron
import matplotlib.pyplot as plt

Populating the interactive namespace from numpy and matplotlib


In [2]:
import importlib
importlib.reload(gtnlplib.preproc)

<module 'gtnlplib.preproc' from 'gtnlplib/preproc.pyc'>

In [3]:
## Define the file names
trainfile = gtnlplib.constants.TRAIN_FILE
devfile = gtnlplib.constants.DEV_FILE
testfile = gtnlplib.constants.TEST_FILE # You do not have this for now
offset = gtnlplib.constants.OFFSET

In [4]:
# for convenience
tr_all = []
for i,(words,tags) in enumerate(gtnlplib.preproc.conllSeqGenerator(trainfile)):
    tr_all.append((words,tags))

In [5]:
## Demo
alltags = set()
for i,(words, tags) in enumerate(gtnlplib.preproc.conllSeqGenerator(trainfile)):    
    for tag in tags:
        alltags.add(tag)
print(alltags)


set(['!', '#', '$', '&', ',', 'A', '@', 'E', 'D', 'G', 'M', 'L', 'O', 'N', 'P', 'S', 'R', 'U', 'T', 'V', 'Y', 'X', 'Z', '^', '~'])


# 1. Classification-based tagging #

First, you will perform tagging as classification problem.

Recall that in structured prediction, we have the feature function decompose:

\begin{align}
\renewcommand{\vec}[1]{\mathbf{#1}}
\vec{f}(\vec{w},\vec{y}) & = \sum_m \vec{f}(\vec{w},y_m, y_{m-1}, m)
\end{align}

You will explicitly define your feature functions in this way -- even for the classification-based tagger, which won't consider $y_{m-1}$. The features themselves are defined as tuples, as in pset 3.

Here is a simple example:

In [6]:
def wordFeatures(words,tag,prev_tag,m):
    '''
    :param words: a list of words
    :type words: list
    :param tag: a tag
    :type tag: string
    :type prev_tag: string
    :type m: int
    '''
    out = {(offset,tag):1}
    if m < len(words): #we can have m = M, for the transition to the end state
        out[(gtnlplib.constants.EMIT,tag,words[m])]=1
    return out

In [7]:
sent = 'they can can fish'.split()

In [8]:
wordFeatures(sent,'V','V',0)

{('**OFFSET**', 'V'): 1, ('--EMISSION--', 'V', 'they'): 1}

**Deliverable 1a** (1 point) Complete feature function 'wordCharFeatures' in gtnlplib/features.py, which includes the final character of the current word, and the final character of the preceding word (if $m > 1$) along with above features. The names for these features are defined in gtnlplib.constants.

In [9]:
import importlib
importlib.reload(gtnlplib.features)

<module 'gtnlplib.features' from 'gtnlplib/features.pyc'>

In [10]:
# sanity check desired output
print(gtnlplib.features.wordCharFeatures(sent,'V','V',1))
# no prev-suff feature in this one, because m=0
print(gtnlplib.features.wordCharFeatures(sent,'V','V',0))

{('--curr-suff--', 'V', 'n'): 1, ('--EMISSION--', 'V', 'can'): 1, ('**OFFSET**', 'V'): 1, ('--prev-suff--', 'V', 'y'): 1}
{('--curr-suff--', 'V', 'y'): 1, ('**OFFSET**', 'V'): 1, ('--EMISSION--', 'V', 'they'): 1}


Now you will define a classification-based tagger. To get you started, here are some test weights.

In [11]:
test_weights = defaultdict(float)
test_tags = ['N','V','V','N']
for i in range(len(sent)):
    for feat in wordFeatures(sent,test_tags[i],'X',i):
        test_weights[feat] = 1
    for feat in wordFeatures(sent,'X','X',i):
        test_weights[feat] = 1
print(test_weights)

defaultdict(<type 'float'>, {('--EMISSION--', 'X', 'fish'): 1, ('--EMISSION--', 'N', 'fish'): 1, ('--EMISSION--', 'X', 'they'): 1, ('--EMISSION--', 'V', 'can'): 1, ('**OFFSET**', 'V'): 1, ('--EMISSION--', 'N', 'they'): 1, ('**OFFSET**', 'N'): 1, ('--EMISSION--', 'X', 'can'): 1, ('**OFFSET**', 'X'): 1})


In [12]:
# use this to find the highest-scoring label
argmax = lambda x : max(iter(x.items()),key=operator.itemgetter(1))[0]

**Deliverable 1b** (1 point): Complete the function classifierTagger in gtnlplib/tagger_base.py that takes a list of words, feature function, dict of weights, and a tagset, and outputs a list of predicted tags (one per word).

You should use featfunc to get the features and return the list of tags with highest score for each word.

In [13]:
gtnlplib.tagger_base.classifierTagger(sent,wordFeatures,test_weights,alltags)

['N', 'V', 'V', 'N']

In [14]:
confusion = gtnlplib.tagger_base.evalTagger(lambda words,alltags : gtnlplib.tagger_base.classifierTagger(words,wordFeatures,test_weights,alltags),'test')
print(gtnlplib.scorer.accuracy(confusion))

0.139539705577


**Deliverable 1c** (3 points): Apply your averaged perceptron from pset 2 to do part-of-speech tagging. Start by adapting your oneItAvgPerceptron function. You'll have to make some changes:

- Replace your call to the predict() function with a call to classifierTagger()
- The instanceGenerator now produces word lists and tag lists as instances, instead of feature counts.
- You can treat entire sentences as instances, if you want -- this may be slightly easier. This means that you only update the weights after seeing an entire sentence, sort of like a minibatch.
- You'll want to add the feature function as an extra argument to both oneItAvgPerceptron and trainAvgPerceptron
- return the training accuracy rather than the number of errors

Complete oneItAvgPerceptron function from gtnlplib/avg_perceptron.py for this part.

In [15]:
import importlib
importlib.reload(gtnlplib.avg_perceptron)

<module 'gtnlplib.avg_perceptron' from 'gtnlplib/avg_perceptron.pyc'>

In [16]:
weights,wsum,tr_acc,i = gtnlplib.avg_perceptron.oneItAvgPerceptron(tr_all,wordFeatures,defaultdict(float),defaultdict(float),alltags)

In [17]:
#sanity check. The weight sum numbers might be different if you don't treat sentences as instances, which is what I do.
print(weights[gtnlplib.constants.EMIT,'D','the'], wsum[gtnlplib.constants.EMIT,'D','the'])
print(weights[gtnlplib.constants.EMIT,'N','the'], wsum[gtnlplib.constants.EMIT,'N','the'])
print(weights[gtnlplib.constants.EMIT,'V','like'], wsum[gtnlplib.constants.EMIT,'V','like'])
print(weights[gtnlplib.constants.EMIT,'P','like'], wsum[gtnlplib.constants.EMIT,'P','like'])

16.0 2611.0
-1.0 -212.0
2.0 587.0
5.0 942.0


**Deliverable 1d** (2 points): Now adapt trainAvgPerceptron function in gtnlplib/avg_perceptron.py to do tagging. This should require fewer changes than oneItAvgPerceptron, but you will have to:

- take a feature function as an argument
- call evalTagger instead of evalClassifier to get the confusion matrix
- don't forget you've modified oneItAvgPerceptron to return the training set accuracy, not the number of errors

In [18]:
w, tr_acc, dv_acc =  gtnlplib.avg_perceptron.trainAvgPerceptron(10,tr_all,gtnlplib.features.wordCharFeatures,alltags)

0 dev: 0.673439767779 train: 0.523428415076
1 dev: 0.710346257516 train: 0.685477802859
2 dev: 0.729836201534 train: 0.764279362473
3 dev: 0.736678415924 train: 0.824953827211
4 dev: 0.742691270993 train: 0.853615158356
5 dev: 0.746423387933 train: 0.879813940762
6 dev: 0.746423387933 train: 0.89034817703
7 dev: 0.746630727763 train: 0.899445926534
8 dev: 0.747045407423 train: 0.903892195089
9 dev: 0.747460087083 train: 0.904713044668


In [19]:
#You will get the test file later (48 hours before the deadline)
gtnlplib.tagger_base.evalTagger(lambda words,alltags : gtnlplib.tagger_base.classifierTagger(words,gtnlplib.features.wordCharFeatures,w,alltags),'avg_perceptron.response',testfile=devfile)

defaultdict(<type 'int'>, {('D', '^'): 2, ('#', 'R'): 1, (',', ','): 459, ('G', 'G'): 18, ('S', 'N'): 4, ('O', 'P'): 8, ('P', 'U'): 1, ('N', '@'): 6, ('&', 'R'): 1, ('N', 'R'): 8, ('G', 'P'): 1, ('V', '^'): 7, ('U', ','): 1, ('N', ','): 1, ('O', '^'): 2, ('A', 'N'): 58, ('N', 'A'): 4, ('A', 'P'): 1, ('D', 'G'): 1, ('^', ','): 1, ('V', 'N'): 67, ('O', 'O'): 304, ('~', '~'): 143, (',', '@'): 1, ('P', 'R'): 5, ('&', '&'): 87, ('D', 'L'): 3, ('E', ','): 3, ('$', 'U'): 10, ('A', '!'): 2, ('G', 'V'): 6, ('D', '!'): 3, ('^', '!'): 1, ('@', '^'): 14, ('A', '@'): 4, ('R', '^'): 2, ('L', 'V'): 5, ('A', 'R'): 11, ('L', 'D'): 2, ('#', '@'): 8, ('G', 'U'): 4, ('E', 'N'): 1, ('$', '@'): 5, ('P', 'A'): 3, ('$', '~'): 1, ('R', '&'): 1, ('N', '^'): 31, ('@', 'N'): 27, ('R', 'N'): 19, ('@', 'G'): 1, ('V', '!'): 1, ('$', 'A'): 1, ('G', '~'): 4, ('@', 'U'): 6, ('A', 'T'): 1, ('#', '^'): 5, ('$', 'P'): 5, ('^', 'D'): 2, ('#', 'U'): 3, ('T', 'N'): 1, ('E', 'V'): 2, ('$', 'N'): 3, ('P', 'V'): 4, ('G', 'O'): 

**Deliverable 1e** (3 points): Make it better! Design a killer feature set that improves performance on the devset.

I'm able to get above 84% on the dev set, without going too crazy. Warning: my additional features slow things down considerably.


Please complete yourFeatures function from gtnlplib/features.py for this.
In order to pass unit tests for this you should be able to get at least 81%.

In [20]:
import importlib
importlib.reload(gtnlplib.features)

<module 'gtnlplib.features' from 'gtnlplib/features.pyc'>

In [21]:
w, tr_acc, dv_acc = gtnlplib.avg_perceptron.trainAvgPerceptron(15,tr_all,gtnlplib.features.yourFeatures,alltags)

0 dev: 0.809869375907 train: 0.689103221835
1 dev: 0.827078581796 train: 0.85169984267
2 dev: 0.835994194485 train: 0.908064847117
3 dev: 0.840970350404 train: 0.933511184076
4 dev: 0.842421729214 train: 0.949654559135
5 dev: 0.844287787684 train: 0.957042205349
6 dev: 0.843665768194 train: 0.963403789589
7 dev: 0.843043748704 train: 0.971749093645
8 dev: 0.842836408874 train: 0.977768657227
9 dev: 0.844495127514 train: 0.982283329913
10 dev: 0.845324486834 train: 0.982625350571
11 dev: 0.845117147004 train: 0.982214925782
12 dev: 0.845117147004 train: 0.986592790205
13 dev: 0.844287787684 train: 0.989739380259
14 dev: 0.844702467344 train: 0.989397359601


In [23]:
gtnlplib.tagger_base.evalTagger(lambda words,alltags : gtnlplib.tagger_base.classifierTagger(words,gtnlplib.features.yourFeatures,w,alltags),'avg_perceptron_custom.response',testfile=devfile)

defaultdict(<type 'int'>, {('D', '^'): 6, (',', ','): 473, ('N', '#'): 4, ('G', 'G'): 22, ('S', 'N'): 2, ('O', 'P'): 4, ('N', '@'): 1, ('N', 'R'): 9, ('G', 'P'): 1, ('V', '^'): 14, ('U', ','): 1, ('O', '^'): 1, ('A', 'N'): 28, ('N', 'A'): 20, ('A', 'P'): 1, ('D', 'G'): 1, ('^', ','): 1, ('V', 'N'): 44, ('N', 'S'): 1, ('O', 'O'): 316, ('~', '~'): 157, ('G', 'E'): 3, ('P', 'R'): 4, ('&', '&'): 86, ('D', 'L'): 2, ('E', ','): 4, ('A', '!'): 3, ('G', 'V'): 5, ('D', '!'): 2, ('Z', '^'): 1, ('^', '!'): 4, ('U', '~'): 1, ('R', '^'): 3, ('L', 'V'): 1, ('A', 'R'): 4, ('L', 'D'): 3, ('$', 'R'): 2, ('^', 'Z'): 2, ('S', 'S'): 1, ('P', 'P'): 413, ('N', 'L'): 1, ('&', 'N'): 1, ('&', 'D'): 3, ('Z', 'N'): 3, ('O', '&'): 1, ('N', '^'): 34, ('V', '#'): 2, ('R', 'N'): 10, ('^', '#'): 5, ('A', 'T'): 1, ('#', '^'): 6, ('$', 'P'): 2, ('^', 'D'): 3, ('S', 'Z'): 2, ('D', 'X'): 1, ('$', 'N'): 6, ('P', 'V'): 1, ('G', 'O'): 1, ('P', 'O'): 1, ('N', 'N'): 522, ('P', 'D'): 4, ('^', 'E'): 1, ('G', '~'): 2, ('R', 'P')

# 2. Discriminative Structure Prediction #

Now you will implement a Structured Perceptron, which is trained to find the optimal *sequence* $\vec{y} = \text{arg}\max_\vec{y} \theta^{\top} \vec{f}(\vec{w},\vec{y})$

A key difference from the classification-based setting is that we compute features over the entire sequence.

**Deliverable 2a** (1 point): Implement a function seqFeatures in gtnlplib/features.py , which takes a list of words, a list of tags, and a feature function, and returns a dictionary of features and their counts.

In [24]:
import importlib
importlib.reload(gtnlplib.features)

<module 'gtnlplib.features' from 'gtnlplib/features.pyc'>

In [25]:
gtnlplib.features.seqFeatures(sent,['N','V','V','N'],wordFeatures)

defaultdict(<type 'float'>, {('--EMISSION--', 'N', 'fish'): 1.0, ('--EMISSION--', 'V', 'can'): 2.0, ('**OFFSET**', 'V'): 2.0, ('--EMISSION--', 'N', 'they'): 1.0, ('**OFFSET**', 'N'): 2.0, ('**OFFSET**', '--END--'): 1.0})

**Deliverable 2b** (1 point): now complete the function wordTransFeatures in gtnlplib/features.py, which adds tag-to-tag transition features to wordFeatures. Note that this feature set is identical to what the HMM uses.

In [26]:
gtnlplib.features.seqFeatures(sent,['N','V','V','N'],gtnlplib.features.wordTransFeatures)

defaultdict(<type 'float'>, {('--TRANS--', 'N', '--START--'): 1.0, ('--TRANS--', '--END--', 'N'): 1.0, ('--EMISSION--', 'N', 'fish'): 1.0, ('--EMISSION--', 'V', 'can'): 2.0, ('**OFFSET**', 'V'): 2.0, ('--EMISSION--', 'N', 'they'): 1.0, ('--TRANS--', 'V', 'V'): 1.0, ('--TRANS--', 'N', 'V'): 1.0, ('**OFFSET**', 'N'): 2.0, ('--TRANS--', 'V', 'N'): 1.0, ('**OFFSET**', '--END--'): 1.0})

**Deliverable 2c** (1 point): copy in your viterbiTagger from problem set 3. If you implemented it correctly, you should be able to use it without modification here.

In [27]:
import importlib
importlib.reload(gtnlplib.viterbi)

<module 'gtnlplib.viterbi' from 'gtnlplib/viterbi.pyc'>

In [28]:
gtnlplib.viterbi.viterbiTagger(['they','can','can','fish'],gtnlplib.features.wordTransFeatures,test_weights,alltags)

(['N', 'V', 'V', 'N'], 8.0)

**Deliverable 2d** (3 points): Complete the function oneItAvgStructPerceptron in gtnlplib/str_perceptron.py, which performs a single iteration of averaged structured perceptron. It should be similar to your oneItAvgPerceptron, but will have to be different in some ways to reflect the structured prediction scenario.

- To make predictions, you must call your viterbiTagger function
- To compute the features for a given sequence of words and tags, you must call your seqFeatures function
- As above, output the training accuracy, not the number of training errors

In [29]:
import importlib
importlib.reload(gtnlplib.str_perceptron)

<module 'gtnlplib.str_perceptron' from 'gtnlplib/str_perceptron.pyc'>

Speed is important here. Use this line to benchmark your code.
- My "optimized" implementation takes 1.1 seconds per iteration. 
- My "less optimized" implementation takes 1.6 seconds per iteration.

In [30]:
%%timeit
weights,wsum,tr_acc,i = gtnlplib.str_perceptron.oneItAvgStructPerceptron(tr_all[:100],
                                                                         gtnlplib.features.wordTransFeatures,
                                                                         defaultdict(float),
                                                                         defaultdict(float),
                                                                         alltags)
# careful, the %%timeit magic means that this block doesn't change the notebook state 

1 loops, best of 3: 1.11 s per loop


In [31]:
weights,wsum,tr_acc,i = gtnlplib.str_perceptron.oneItAvgStructPerceptron(tr_all[:100],gtnlplib.features.wordTransFeatures,defaultdict(float),defaultdict(float),alltags)

In [32]:
for tag1 in list(alltags)[:7]:
    for tag2 in list(alltags)[:7]:
        if weights[gtnlplib.constants.TRANS,tag1,tag2] != 0:
            print(tag1,tag2,weights[(gtnlplib.constants.TRANS,tag1,tag2)],wsum[gtnlplib.constants.TRANS,tag1,tag2])

! ! -29.0 18.0
! , 2.0 -49.0
! @ 5.0 -75.0
# # -3.0 -130.0
# , -3.0 -329.0
$ $ -14.0 -194.0
$ , 1.0 -164.0
& @ 2.0 127.0
, ! 3.0 -196.0
, # -4.0 -393.0
, $ 3.0 68.0
, & -3.0 -150.0
, , -1.0 -43.0
, A 3.0 158.0
, @ 1.0 43.0
A ! 2.0 108.0
A , -1.0 -36.0
A A -11.0 -400.0
A @ -1.0 -33.0
@ ! 1.0 95.0
@ & 1.0 45.0
@ @ -9.0 -412.0


**Deliverable 2e** (2 points): Implement trainAvgStructPerceptron in gtnlplib/str_perceptron.py. This will be quite similar to your trainAvgPerceptron from ps2, but will have to take slightly different arguments to handle the structured prediction case. Don't forget to use evalTagger to produce output.

In [35]:
# your code should roughly reproduce this sanity check. It may be a little slow, so we'll just test on the first 50 instances.
# While you're debugging your code, you can run on even smaller datasets.
theta,tr_acc,dv_acc = gtnlplib.str_perceptron.trainAvgStructPerceptron(5,tr_all[:50],gtnlplib.features.wordTransFeatures,alltags)

0 dev: 0.373833713456 train: 0.207874015748
1 dev: 0.428778768401 train: 0.363779527559
2 dev: 0.472527472527 train: 0.587401574803
3 dev: 0.494920174165 train: 0.749606299213
4 dev: 0.513580758864 train: 0.763779527559


In [36]:
theta,tr_acc,dv_acc = gtnlplib.str_perceptron.trainAvgStructPerceptron(10,tr_all,gtnlplib.features.wordTransFeatures,alltags)

0 dev: 0.66369479577 train: 0.488542307955
1 dev: 0.699979266017 train: 0.665366988166
2 dev: 0.717395811735 train: 0.742253232095
3 dev: 0.730250881194 train: 0.806484711677
4 dev: 0.741861911673 train: 0.849716122854
5 dev: 0.744764669293 train: 0.874820439155
6 dev: 0.748496786233 train: 0.893015938163
7 dev: 0.750155504872 train: 0.905055065326
8 dev: 0.752850922662 train: 0.914700047883
9 dev: 0.754924320962 train: 0.920445994938


In [37]:
confusion = gtnlplib.tagger_base.evalTagger(lambda words, alltags : gtnlplib.viterbi.viterbiTagger(words,gtnlplib.features.wordTransFeatures,theta,alltags)[0],'str_avg_perceptron.response',testfile=devfile)

**Deliverable 2f** (3 points): Implement a better feature set for structured prediction by completing yourHMMFeatures function in gtnlplib/features.py. For speed reasons, you might not want to use all the features you used in 4e, but try to get as good an accuracy as you can. Last year I was able to get my structured perceptron to work a little better than my best classifier, but this year my classifier is (very slightly) better!

In [34]:
import importlib
importlib.reload(gtnlplib.features)

<module 'gtnlplib.features' from 'gtnlplib/features.pyc'>

In [35]:
theta,tr_acc,dv_acc = gtnlplib.str_perceptron.trainAvgStructPerceptron(10,tr_all,gtnlplib.features.yourHMMFeatures,alltags)

0 dev: 0.754509641302 train: 0.658800191532
1 dev: 0.789135392909 train: 0.813598741364
2 dev: 0.822309765706 train: 0.88268691429
3 dev: 0.77814638192 train: 0.913126752856
4 dev: 0.799502384408 train: 0.935289691497
5 dev: 0.832676757205 train: 0.944866269923
6 dev: 0.833713456355 train: 0.956905397086
7 dev: 0.833920796185 train: 0.9655927218
8 dev: 0.831225378395 train: 0.974280046515
9 dev: 0.831847397885 train: 0.973322388672


In [None]:
from gtnlplib import scorer

In [36]:
confusion = gtnlplib.tagger_base.evalTagger(lambda words, alltags : gtnlplib.viterbi.viterbiTagger(words,gtnlplib.features.yourHMMFeatures,theta,alltags)[0],'str_avg_perceptron_custom.response',testfile=devfile)
print(scorer.accuracy(confusion))

0.841023489933


# 3. Error analysis #

(3 points; 7650 only). The scorer.py script produces a confusion matrix, which shows the most common types of errors. Consider your best tagger in any part of the assignment, and identify the three most frequent errors (e.g., N classified as V). Find an example sentence in your tagger has made each type of error, and explain why you think it made the mistake, and how it could be fixed. (If you are feeling competitive, you can then use this information to go back and try to improve your features.)

# 4. Bakeoff! #

48 hours before the assignment is due, we will send you unlabeled test data. Your job is to produce a response file that I can evaluate. I'll present the results in class and give the best scorers a chance to explain what they did.


** Deliverable 4 ** (3 points) Run your best system from any part of the
assignment on the test data using the `generateKaggleSubmission()` function. Submit
your response file to the class [Kaggle bakeoff](https://inclass.kaggle.com/c/gt-book-review-sentiment-analysis). Also **submit your Kaggle response file to T-Square as 'lastname-firstname.response'.** The top
scores will be announced in class.


In [None]:
confusion = gtnlplib.tagger_base.evalTagger(lambda words, alltags : gtnlplib.viterbi.viterbiTagger(words,gtnlplib.features.yourHMMFeatures,theta,alltags)[0],'lastname-firstname.response',testfile=devfile)