Skip to content
Java text categorization system
Java Groovy
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src
LICENSE
README.md
pom.xml

README.md

JaTeCS (Java Text Categorization System)

JaTeCS is an open source Java library focused on Automatic Text Categorization (ATC). It covers all the steps of an experimental activity, from reading the corpus to the evaluation of the experimental results. JaTeCS focuses on text as the central input, and its code is optimized for this type of data. As with many other machine learning (ML) frameworks, it provides data readers for many formats and well-known corpora, NLP tools, feature selection and weighting methods, the implementation of many ML algorithms as well as wrappers for well-known external software (e.g., libSVM, SVM_light). JaTeCS also provides the implementation of methods related to ATC that are rarely, if never, provided by other ML framework (e.g., active learning, quantification, transfer learning).

The software is released under the terms of GPL license.

Software installation

To use the latest release of JaTeCS in your Maven projects, add the following on your project POM:

<repositories>

    <repository>
        <id>jatecs-mvn-repo</id>
        <url>https://github.com/jatecs/jatecs/raw/mvn-repo/</url>
        <snapshots>
            <enabled>true</enabled>
            <updatePolicy>always</updatePolicy>
        </snapshots>
    </repository>

</repositories>

then in the dependencies list add

<dependencies>

    <dependency>
        <groupId>hlt.isti.cnr.it</groupId>
        <artifactId>jatecs-gpl</artifactId>
        <version>1.0.0</version>
    </dependency>
    
</dependencies>

How to develop your apps with the software

Data representation through IIndex data structure

In JaTeCS, the raw textual data is manipulated through the use of an indexed structure named IIndex. This data structure handles all relations among documents, features, and categories (which could be defined in a taxonomy). The IIndex can be used to manipulate or query data.

The following snippet shows a very simple example printing the number of terms that appear more than 5 times in each document:


	for(int docID : index.getDocumentDB().getDocuments()) {
		String documentName = index.getDocumentDB().getDocumentName(docID);                
		int frequentTerms = 0;
		for (int featID : index.getContentDB().getDocumentFeatures(docID)) {
			if (index.getContentDB().getDocumentFeatureFrequency(docID, featID) > 5)
				frequentTerms++;
		}
		System.out.println("Document "+documentName+" contains " + frequentTerms + " frequent terms");
	}	

A richer example on the use of the IIndex structure can be found in IndexQuery.java.

The class TroveMainIndexBuilder.java is meant to create the IIndex, which might be used independently, or in combination with CorpusReader.java and FullIndexConstructor.java to construct an index from a raw corpus of documents (several examples of this latter could be consulted in directory dataset, including, e.g., the Reuters21578 collection -file IndexReuters21578.java)-, the RCV1-v2 collection -file IndexRCV1.java-, among many others). JaTeCS provides common feature extractors to represent features from raw textual data, like the BoW (bag-of-words) extractor, or the characters n-grams extractor. The dataset directory contains many examples of corpus indexing. Both extractors are subclasses of the generic class FeatureExtractor.java which provides additional capabilities like stemming, stopword removal, etc.

Preparing data for experiments: feature selection and feature weighting

Once the index has been created and instantiated, a common practice often followed in the experimentation pipeline consists of selecting most informative features (and discarding the rest). JaTeCS provides several implementations of global (see GlobalTSR.java) or local (see LocalTSR.java) Term Selection Reduction (TSR) methods. JaTeCS also provides many implementations of popular TSR functions, including InformationGain, ChiSquare, GainRatio, among many others. Additionally, the GlobalTSR can be set with different policies, such as sum, average, or max (subclasses of IGlobalTSRPolicy.java). JaTeCS also implements the RoundRobinTSR method, which selects the most important features to each category in a round robin manner. The following snippet illustrates how round robin feature selection with information gain is carried out in JaTeCS (see the full example here):


	RoundRobinTSR tsr = new RoundRobinTSR(new InformationGain());
	tsr.setNumberOfBestFeatures(5000);
	tsr.computeTSR(index);	

The last step in data preparation consists of weighting the features so as to bring bear to the "relative importance" of terms in the documents. JaTeCS offers two such popular methods, including the well-known TfIdf and BM25. Generally, the weighting function (here exemplified by TfIdf) is to be applied as follows (see the complete example here):


	IWeighting weighting = new TfNormalizedIdf(trainIndex);
	IIndex weightedTrainingIndex = weighting.computeWeights(trainIndex);
	IIndex weightedTestIndex = weighting.computeWeights(testIndex);	

Building the classifier

Building a classifier typically involves a two-step process, including (i) model learning (ILearner), and (ii) document classification IClassifier. JaTeCS implements several machine learning algorithms, including: AdaBoost-MH, MP-Boost, KNN, logistic_regression, naive bayes, SVM, among many others (placed in the source directory classification).

The following code shows how SVMlib could be trained in JaTeCS (check LearnSVMlib.java for the full example, and the source directory classification for examples involving other learning algorithms):


		SvmLearner svmLearner = new SvmLearner();
		IClassifier svmClassifier = svmLearner.build(trainIndex);

Once trained, the model could be used to classify unseen documents. This is carried out in JaTeCS by running a classifier, instantiated with the previous model parameters and receiving as argument an index containing all test documents to be classified (a full example is available in ClassifySVMlib.java), i.e.,:


		Classifier classifier = new Classifier(testIndex, svmClassifier);
		classifier.exec();

JaTeCS also brings support to evaluation of results by means of the following classes: ClassificationComparer.java (simple flat evaluation) and HierarchicalClassificationComparer.java (evaluation for hierarchical taxonomies of codes); a full example involving both evaluation procedures could be found here. Evaluation is easily performed in JaTeCS in just few lines of code, e.g., :


		ClassificationComparer flatComparer = new ClassificationComparer(classifier.getClassificationDB(), testIndex.getClassificationDB());
		ContingencyTableSet tableSet = flatComparer.evaluate();

Applications of Text Classification

JaTeCS includes many ready-to-use applications that could be useful for users which are mainly interested in running experiments on their own data quickly, but also for the practitioners, that might rather be interested in developing their own algorithms and applications; those might found on the JaTeCS apps implementations a perfect starting point where to start familiarizing with the framework through examples. In what follows, we show some selected examples, while many others could be found here.

Text Quantification

Text Quantification is the problem of estimating the distribution of labels in a collection of unlabeled documents, when the distribution in the training set may substantially differ. Though quantification processes a dataset as a single entity, the classification of single documents is the building block on which many quantification methods are built. JaTeCS implements a number of classification-based quantification methods:

  • Classify and Count
  • Adjusted Classify and Count
  • Probabilistic Classify and Count
  • Probabilistic Adjusted Classify and Count

All of their implementations are independent from the underlying classification method that acts as a plug-in component, as shown in the following code from the library (see the complete examples):


	int folds = 50;
	IScalingFunction scaling = new LogisticFunction();
	// any other learner can be plugged in
	ILearner classificationLearner = new SvmLightLearner();

	// learns six different quantifiers on training data (train is an IIndex object)
	QuantificationLearner quantificationLearner = new QuantificationLearner(folds, classificationLearner, scaling);
	QuantifierPool pool = quantificationLearner.learn(train);

	// quantifies on test returning the six predictions (test is an IIndex object)
	Quantification[] quantifications = pool.quantify(test);
	// evaluates predictions against true quantifications
	QuantificationEvaluation.Report(quantifications,test);

Transfer Learning

Transfer Learning concerns with leveraging the supervised information available for a source domain of knowledge in order to deploy a model that behaves well on a target domain (to which few, or none, labelled information exists), thus reducing, or completely avoiding, the need for human labelling effort in the target domain. In the context of ATC two scenarios are possible (i) cross-domain TC, where the source and target documents deal with different topics (e.g., book reviews vs music reviews), and (ii) cross-lingual TC, in which the source and target documents are written in different languages (e.g., English vs German book reviews). JaTeCS includes an implementation of the Distributional Correspondence Indexing (DCI), a feature-representation-transfer method for cross-domain and cross-lingual classification, described here. The class DCImain.java offers a complete implementation of the method, from the reading of source and target collections to the evaluation of results.

Active Learning, Training Data Cleaning, and Semi Automated Text Classification

JaTeCS provides implementations of a rich number of methods proposed for three classes problems that are interrelated: Active Learning (AL), where the learning algorithm is prompted to select which documents to add to the training set at each step, with the aim of minimizing the amount of human labeling needed to obtain high accuracy; Training Data Cleaning (TDC), that consists of using learning algorithms to discover labeling errors in an already existing training set; and Semi-Automated Text Classification (SATC), which aims at reducing the amount of effort a human should invest while inspecting, and eventually repairing, the outcomes produced by a classifier in order to guarantee a required accuracy level.

Distributional Semantic Models

ATC often relies on a BoW model to represent a document collection, according to which each document could be though as a row in a matrix where each column informs about the frequency (see IContentDB), or the relative importance (see IWeightingDB) of each distinct feature (usually terms or n-grams) to that document, disregarding word order. Other representation mechanisms have been proposed as alternatives, including the distributional semantic models (DSM), which typically project the original BoW model into a reduced space, where semantics between terms is somehow modelled. JaTeCS implements a number of DSM, covering some Random Projections methods (such as Random Indexing, or the Achlioptas mapping), and Latent Semantic Analysis (by wrapping the popular SVDLIBC implementation). Those methods are available as part of full applications, from reading the indexes to the evaluation of results.

Imbalanced Text Classification

The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. JaTeCS provides a number of SMOTE-based implementations, including the original SMOTE approach, BorderSMOTE, and SMOTE-ENN. JaTeCS also provides an implementation of the recently proposed Distributional Random Oversampling (DRO) , an oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds. Full applications using these methods are also provided in JaTeCS.

You can’t perform that action at this time.