ID3.java

Introduction

The Iterative Dichotomiser 3 Algorithm (ID3) is used in supervised machine learning problems. It operates by constructing a Decision Tree – which essentially splits the entire range of test cases into sub-cases depending on the value of certain attributes. The split is done on the basis of Information Gain.

Implementation Details

Node Definition

Essentially, each node in the decision tree contains the following details:

A list of training test cases.
The best attribute to split this node upon (if not leaf Node).
Whether it is a terminal or leaf Node: i.e. all testcases of this node have the same output class label.
The majority output class label – i.e. the output class label that occurs most frequently in the test cases of the node.

Pruning

Pruning aims to reduce the size and complexity of the tree and hence avoid overfitting. The library can implement two types of pruning – Pre-Pruning and Post-Pruning. By empirically testing with a given dataset, it can be determined which pruning has to be used (both can also be used together).

Pre-Pruning is implemented while the tree is being constructed (and hence the name). Essentially it uses the Chi-squared test (with significance value = 0.05) to check if a split (even though has the highest information gain) is significant. If yes, proceed with the algorithm. If not, then that node is not split further and it is labeled with the majority output label of its test cases.
Pessimistic Post-pruning is used after the tree is constructed. It traverses through each node of the tree and checks if the pessimistic error rate after the split is greater than the pessimistic error rate before the split. If so, the split is not statistically warranted and hence that node’s subtree is pruned.

Handling Continuous Variables

Certain attributes in the provided datasets can be continuous variables. The library handles continuous attributes by discretizing them into “n”-equal sized bins, depending on the actual continuous attribute value.

The user has an option to specify “n”. For instance consider the list 1.0, 1.2, 1.3, 1.5 and n=2. In this case the attribute value list will be transformed to a, a, b, b where a and b are the two discretized attribute values. So there are two bins here each of size 2.

Note: Discrete values mean “discrete independent values”. So even though (by actual ordering of the original list) a < b, it is assumed that a and b are two discrete independent values.

Handling missing attribute values

In case of missing attribute values; the library employs the following techniques depending on if the attribute is continuous or discrete.

If the attribute is continuous: The missing value is assigned the mean value of all the non-missing values of that attribute. So essentially if you have values of “1.0, 1.5, 2.0, ?” for four test instances of a particular attribute – the “?” is replaced with the average of 1.0, 1.5 and 2.0, i.e. 1.5.
If the attribute is discrete: The missing value is assigned the most commonly occurring attribute value for all the non-missing values of that attribute. So essentially if you have values of “c, a, b, b, b, ?” the missing value is replaced with “b”.

Functions Summary

Constructor Summary

public ID3 (String fName, int numAttributes, int testCases, String delimiter, int limitSplits)

Creates a new ID3 learner with the given parameters. 

Parameters
      -fName: File containing the training data.
      -numAttributes: number of attributes in the training data (not counting the output)
      -testCases: number of test cases to be considered from the training file. 
      -delimiter: delimiter between the attributes in a test case.
      -limitSplits: number of splits for continuous valued attributes 

Throws
      -IOException: Failed or interrupted while reading from file. 
      -FileNotFoundException: File fname is not found.

Method Summary

public float findAccuracy (String fTestName, int validationSize, String delim)

Find accuracy of learner on the given test set. (Can be called only after training). 

Parameters
      -fTestName: name of the file containing the test data. 
      -validationSize: number of test data instances. 
      -delim: delimiter to seperate attributes in a single instance of the test data. 

Throws
      -IOException: Failed or interrupted while reading from file. 
      -FileNotFoundException: File fTestName is not found.

public int returnNumNodes ()

Returns the number of nodes in the constructed decision tree.

Parameters
      None

Throws
      None

public void drawDecisionTree (String outFile, int NodeSize)

Draw the decision tree to the given output file (.png). 

Parameters
     -outFile: Output file to which the decision tree is drawn.
     -NodeSize: Diameter of a single node in pixels. 

Throws
     None

public void postPruneTree ()

Used to post-prune a constructed decision tree. (Can be called only after training). 

Parameters
    None

Throws
    None

public void createDecisionTree (boolean isPrePruning)

Used to construct the decision tree. This is actual training phase. 

Parameters
   isPrePruning: Indicates if pre-pruning has to be done while constructing the tree.

Throws
   None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly