Skip to content

A Naive Bayes classifier and to do topic classification

Notifications You must be signed in to change notification settings

ollie283/navie-bayes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text classification

A program written in python that performs Naive Bayes Classification.

Data: The folder has 3 files: sampleTrain.txt, sampleTrain.vocab.txt and sampleTest.txt. sampleTrain.txt is the training data for building a classifier. The vocabulary of the training data is in sampleTrain vocab.txt. The classifier should be finally run and evaluated on test data in sampleTest.txt. The second column in the sampleTrain.txt and sampleTest.txt files gives the gold standard true class for each document. The first column of these files is the document id, the third column gives the words in the document. The columns are separated by tab spaces.

There are 2 classes in the data 0 and 1.

Task: Build a Naive Bayes classifier using the document words as features. It should compute a model given some training data and be able to predict classes on a new test set. For this project, use sampleTrain.txt for training a model and the model should be used to predict classes for documents in sampleTest.txt. Use Laplace smoothing for feature likelihoods. There is no need for UNK token. The dataset has been simplified so that the test corpus only contains words seen during training (so no need for UNK). There is also no need to smooth the prior probabilities.

Code: Should run without any arguments. It should read files in the same directory. Absolute paths must not be used. It should print values in the following format:

    Prior probabilities
    class 0 = 
    class 1 =

    Feature likelihoods 
            great    sad    boring ... 
    class 0 
    class 1

    Predictions on test data 
    d5 = 
    d6 = 
    d7 = 
    d8 = 
    d9 = 
    d10 =

    Accuracy on test data =

The features in the feature likelihood table (great, sad, boring, ...) can be printed in any order.

Releases

No releases published

Packages

No packages published

Languages