Skip to content

Small project on training Trigram Models and using them to perform classification tasks

Notifications You must be signed in to change notification settings

jkafrouni/trigram_model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

COMS 4705 - NLP - Prof. Daniel Bauer

Homework 1 - Building a Trigram Language Model

This is a small homework that I did in the context of an NLP course at Columbia University. The script trigram_model.py contains a class Trigram Model that builds a trigram model over a training corpus, and can be used to get the perplexity of a test corpus, which is a metric that evaluates how well the model predicts each word (low perplexity is better).

Model

When training, the model sets all words with count 1 in the training courpus as unknown tokens ('UNK'), which allows to approximate the probability of truly unseen tokens on the test corpus (which are also set as 'UNK').

The model uses linear interpolation to cope with unseen trigrams: the probability of a trigram is a linear function of the unigram and bigram it contains, therefore the probabilities are smoothed and no trigram gets a probability of 0.

Predictions

We use the dataset to perform binary classification over a dataset of essays which can have a "high score" or "low score": We first train a trigram model on a training dataset of low scores, and a second trigram model on a training dataset of high scores.

To predict the score of an unknown essay, we compute its complexity with respect to each model. Perplexity being a metric that describes how well words of a corpus are predicted by an n-gram model, and for which lowest complexity is better, we assign the essay to the class which model returns the lowest complexity.

Sentence generation

The TrigralModel class also contains a method "generate_sentence" that allows to randomly generate a sentence (of a given maximal size) using the trigram probabilities learned on the training corpus.

Data

The dataset used for this project being proprietary, it has not been added to this repo. Yet a small dataset used to debug the models can be found in the data folder.

About

Small project on training Trigram Models and using them to perform classification tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages