Skip to content
This repository has been archived by the owner on Oct 21, 2023. It is now read-only.
/ dctool2 Public archive

Document text classifier training using luigi

License

Notifications You must be signed in to change notification settings

pmatigakis/dctool2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dctool2 is a collection of luigi tasks that train a document text classifier.

Installation

Create and activate a virtualenv environment

virtualenv --python=python3 virtualenv
source virtualenv/bin/activate

Download and install dctool

python setup.py install

Usage

dctool2 requires some labeled documents to be stored in a file on an hdfs folder. Every line on that file must contain one json encoded object per document. The contents of the object must have the following schema.

{
    "text": "the document content",
    "category": "the document category"
}

Start the luigi scheduler

luigid --pidfile /path/to/pid/file --logdir /path/to/logs --state-path /path/to/state/file

Run the luigi tasks. The CreateClassifier task will perform a grid search to find the parameters that give the best classification result.

The following parameters must be given in the luigi.cfg file

variable description
documents-file the hdfs path to the training documents
output-folder the path to store the results
categories what categories to use in the classifier
test-size the test set size
min-df-list the term minimum document frequency
max-df-list the term maximum document frequency
percentile-list the percentile of features to keep
namenode-host the hadoop namenode address
namenode-port the hadoop namenode port

Start the task with the following command

luigi --module dctool2.categories.tasks CreateClassifier --workers 4 

The trained classifier will be in the <output-folder>/trained_classifier/classifier.pickle file. Use scikit-learns's sklearn.externals.joblib module to load it.

The classifier evaluation will be stored in the <output-folder>/analysis folder.

Keep in mind that training can take a long time. On a laptop with an i3-3217U CPU and 8GB of RAM it took about an hour to train a classifier using a 2000 document dataset with several different parameters.

About

Document text classifier training using luigi

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages