Skip to content
This repository has been archived by the owner on Oct 21, 2023. It is now read-only.
/ dctool2 Public archive

Document text classifier training using luigi


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



46 Commits

Repository files navigation

dctool2 is a collection of luigi tasks that train a document text classifier.


Create and activate a virtualenv environment

virtualenv --python=python3 virtualenv
source virtualenv/bin/activate

Download and install dctool

python install


dctool2 requires some labeled documents to be stored in a file on an hdfs folder. Every line on that file must contain one json encoded object per document. The contents of the object must have the following schema.

    "text": "the document content",
    "category": "the document category"

Start the luigi scheduler

luigid --pidfile /path/to/pid/file --logdir /path/to/logs --state-path /path/to/state/file

Run the luigi tasks. The CreateClassifier task will perform a grid search to find the parameters that give the best classification result.

The following parameters must be given in the luigi.cfg file

variable description
documents-file the hdfs path to the training documents
output-folder the path to store the results
categories what categories to use in the classifier
test-size the test set size
min-df-list the term minimum document frequency
max-df-list the term maximum document frequency
percentile-list the percentile of features to keep
namenode-host the hadoop namenode address
namenode-port the hadoop namenode port

Start the task with the following command

luigi --module dctool2.categories.tasks CreateClassifier --workers 4 

The trained classifier will be in the <output-folder>/trained_classifier/classifier.pickle file. Use scikit-learns's sklearn.externals.joblib module to load it.

The classifier evaluation will be stored in the <output-folder>/analysis folder.

Keep in mind that training can take a long time. On a laptop with an i3-3217U CPU and 8GB of RAM it took about an hour to train a classifier using a 2000 document dataset with several different parameters.


Document text classifier training using luigi







No packages published
