A class project for CS585: Introduction to Natural Language Processing. A Structured Perceptron Linear Classifier and Viterbi Decoder for Part of Speech Tagging.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
README.md
baseline.py
license.txt
oct27.dev
oct27.train
structperc.py
vit.py

README.md

Structured Perceptron and Viterbi Based Part of Speech Tagger

A project for CS585 - Introduction to Natural Language Processing

Assignment Description (Viterbi, Perceptron)

Starter code

Data (training, development)

Instructor: Brendan T. O'Connor

Description

Trains a Structured Perceptron Linear Classifier to tag parts of speech using the Viterbi algorithm for decoding. The assignment code has been cleaned up and streamlined to facilitate reading and usage. This means the complete solution to the assignment is not here, just what I deemed the most relevant part for sharing.

Instructor Implementations

baseline.py

All

vit.py

  • dict_argmax
  • goodness_score
  • exhaustive
  • randomized_test

structperc.py

  • dict_subtract
  • dict_argmax
  • dict_dotprod
  • read_tagging_file
  • do_evaluation
  • fancy_eval
  • show_predictions
  • greedy_decode

Modifications to Instructor Implementations

structperc.py

  • local_emission_features: Added suffix features
  • train: Implemented inner loop, core of the training algorithm. Instructor code just a skeleton.

Implementations I provided

vit.py

  • viterbi

structperc.py

  • get_averaged_weights
  • predict_seq
  • features_for_seq-
  • calc_factor_scores-

Demo

To train a tagger with 10 iterations of structured perceptron, using viterbi:

python structperc.py

baseline.py checks the accuracy of assuming every word has the same tag. To check this baseline:

python baseline.py

Usage

# Import
from structperc import train

# Reads tagging files in the format of oct27.train and oct27.dev
import read_tagging_file

# Train with averaging on the oct27.train data, evaluating with oct27.dev data
train(read_tagging_file('oct27.train'), do_averaging=True, devdata=read_tagging_file('oct27.dev'))