# Sequence Labeling, Tagging with Classifiers

- 📺 **Video:** [https://youtu.be/yQZ0mDW-U3g](https://youtu.be/yQZ0mDW-U3g)

## Overview
- Treat sequence labeling as token-level classification with local features.
- Explore structured enhancements like CRFs for capturing label dependencies.

## Key ideas
- **Local classifiers:** logistic regression or perceptrons predict tags independently.
- **Feature templates:** encode context, capitalization, affixes, and lexicons.
- **Error propagation:** greedy decoding can violate global constraints.
- **Structured prediction:** adding CRFs or beam search addresses label dependencies.

## Demo
Train a perceptron-based tagger with contextual features and decode greedily, echoing the lecture (https://youtu.be/Xvx1PxSj8FE).

In [1]:
from sklearn.linear_model import Perceptron
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report

train_data = [
    (['John', 'visits', 'Paris'], ['PROPN', 'VERB', 'PROPN']),
    (['Mary', 'loves', 'coffee'], ['PROPN', 'VERB', 'NOUN']),
    (['Paris', 'offers', 'cafes'], ['PROPN', 'VERB', 'NOUN'])
]

def features(sent, i, prev_tag):
    word = sent[i]
    prev_word = sent[i-1] if i > 0 else '<START>'
    next_word = sent[i+1] if i < len(sent) - 1 else '<END>'
    return {
        'word.lower': word.lower(),
        'is_capitalized': word[0].isupper(),
        'suffix2': word[-2:],
        'prefix1': word[0],
        'prev_word': prev_word,
        'next_word': next_word,
        'prev_tag': prev_tag
    }

vec = DictVectorizer(sparse=False)
X, y = [], []
for words, tags in train_data:
    prev = '<START>'
    for i in range(len(words)):
        X.append(features(words, i, prev))
        y.append(tags[i])
        prev = tags[i]

X = vec.fit_transform(X)
clf = Perceptron(max_iter=50, random_state=0)
clf.fit(X, y)

pred = clf.predict(X)
print(classification_report(y, pred, digits=3))


              precision    recall  f1-score   support

        NOUN      1.000     1.000     1.000         2
       PROPN      1.000     1.000     1.000         4
        VERB      1.000     1.000     1.000         3

    accuracy                          1.000         9
   macro avg      1.000     1.000     1.000         9
weighted avg      1.000     1.000     1.000         9



## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks](https://www.aclweb.org/anthology/W19-4302/)
- [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://arxiv.org/pdf/1804.07461.pdf)
- [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/abs/1906.04341)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf)
- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
- [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700)
- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909.pdf)
- [Byte Pair Encoding is Suboptimal for Language Model Pretraining](https://arxiv.org/pdf/2004.03720.pdf)
- [Eisenstein 8.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [TnT - A Statistical Part-of-Speech Tagger](https://arxiv.org/abs/cs/0003055)
- [Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger](https://www.aclweb.org/anthology/W00-1308/)
- [Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?](https://link.springer.com/chapter/10.1007/978-3-642-19400-9_14)
- [Natural Language Processing with Small Feed-Forward Networks](https://www.aclweb.org/anthology/D17-1309.pdf)
- [Eisenstein 10.1-10.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3-10.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Accurate Unlexicalized Parsing](https://www.aclweb.org/anthology/P03-1054/)
- [Eisenstein 10.5](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 11.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Finding Optimal 1-Endpoint-Crossing Trees](https://www.aclweb.org/anthology/Q13-1002/)
- [Eisenstein 11.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)


*Links only; we do not redistribute slides or papers.*