Skip to content

jlam55555/naive-bayes-text-categorization

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 

naive-bayes-text-categorization

Text categorization using Naive Bayes for ECE467 project 1

Model description

The text categorization is performed using a Naive Bayesian bag-of-words model. Tokenization is performed using NLTK's word_tokenizer utility.

Generating train/test splits

The files in corpora/corpus[1-3]/*, analyze.pl, and corpus[1-3]_train.labels were provided from the course webpage. corpora/corpus1_test.list and corpora/corpus1_test.labels were also provided to demonstrate the format of the test files and test labels.

The *.labels files include both filenames and class labels (for the training dataset and for evaluating the test dataset), and the *.list files include only document filenames (for evaluation).

Random train/test splits for corpora 2 and 3 (corpora/corpus[2-3]_sub.labels, corpora/corpus[2-3]_test.list, and corpora/corpus[2-3]_test.labels), to mimick the train/test structure of corpus 1, were created using train_test_split.py:

$ python3 train_test_split.py
Enter input training file: corpora/corpus3_train.labels
Training subset filename [corpora/corpus3_sub.labels]: 
Test list filename [corpora/corpus3_test.list]: 
Test labels filename [corpora/corpus3_test.labels]: 

Training and prediction

These train/test splits can be used to train a Naive Bayes classifier and predict on the test set:

$ python3 main.py
Enter training file: corpora/corpus3_sub.labels
Enter test file: corpora/corpus3_test.list
Enter out file: predictions/corpus3_pred.labels

Model evaluation

A confusion matrix is generated using analyze.pl on the predicted and true class labels:

$ perl analyze.pl predictions/corpus3_pred.labels corpora/corpus3_test.labels
Processing answer file...
Found 6 categories: Sci Ent Fin Spo Wor USN
Processing prediction file...

150 CORRECT, 25 INCORRECT, RATIO = 0.857142857142857.

CONTINGENCY TABLE:
        Sci     Ent     Fin     Spo     Wor     USN     PREC
Sci     20      0       0       0       0       0       1.00
Ent     0       1       0       0       0       0       1.00
Fin     1       2       19      0       0       0       0.86
Spo     0       0       0       14      0       0       1.00
Wor     0       1       0       1       53      3       0.91
USN     5       4       2       3       3       43      0.72
RECALL  0.77    0.12    0.90    0.78    0.95    0.93    

F_1(Sci) = 0.869565217391304
F_1(Ent) = 0.222222222222222
F_1(Fin) = 0.883720930232558
F_1(Spo) = 0.875
F_1(Wor) = 0.929824561403509
F_1(USN) = 0.811320754716981

About

Naive Bayes text categorization tool for ECE467

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published