Skip to content

Recursive text classifier that builds and updates a decision tree to label documents based on extracted features.

Notifications You must be signed in to change notification settings

kkirke/classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Classifier (Decision Tree)

Overview

This project implements a text classifier in Java using a decision-tree representation. The classifier is trained on labeled text data loaded from CSV files, builds a tree using word-frequency features and numeric thresholds, and can classify new text inputs by traversing the learned tree.

The classifier also supports saving and reloading trained models using a preorder tree format.

This project was completed as part of CSE 123 (Computer Programming) at the University of Washington.


Features

  • Decision-tree classifier with recursive traversal
  • Word-frequency feature extraction
  • Threshold-based branching decisions
  • CSV-based training and testing data loading
  • Save and reload trained classifier trees
  • Accuracy evaluation on test datasets

Key Concepts Used

  • Binary trees
  • Recursion
  • Object-oriented design
  • File I/O
  • Feature extraction from text
  • Data-driven algorithm design

Project Structure

  • Classifier.java // Core classifier logic and tree operations
  • TextBlock.java // Represents labeled text and word-frequency data
  • DataLoader.java // Loads training and testing data
  • CsvReader.java // Parses CSV files into usable records
  • Client.java // Runs training, testing, and classification

How It Works

Text Representation

  • Each document is represented as a TextBlock.
  • A TextBlock stores:
    • A label (classification category)
    • A mapping of words to occurrence counts
    • The total number of words in the document
  • Word probabilities are computed as: count(word) / totalWords

Training the Classifier

  • The classifier is trained on a list of labeled TextBlock objects.
  • Internal decision nodes store:
  • A feature word
  • A numeric threshold
  • During training, examples that are misclassified cause the tree to grow by introducing a new decision node that separates examples based on whether the word probability meets the threshold.

Classification

  • To classify new text, the classifier:
  1. Starts at the root of the tree
  2. Evaluates the stored feature against its threshold
  3. Recursively traverses left or right
  4. Returns the label stored at the leaf node

Saving and Loading Models

  • The classifier can be saved to a file using a preorder traversal format.
  • Each node is written as either:
  • A feature/threshold pair (branch node), or
  • A label (leaf node)
  • The saved file can later be reloaded to reconstruct the exact same tree structure.

Example Workflow

  1. Load training data from a CSV file
  2. Train the classifier on labeled text examples
  3. Evaluate accuracy on a separate test dataset
  4. Save the trained classifier to disk
  5. Reload the classifier and classify new inputs

What I Learned

  • How decision trees encode conditional logic
  • How recursion simplifies tree traversal and reconstruction
  • How feature selection and thresholds affect classification behavior
  • How to design programs that separate data loading, modeling, and execution

Future Improvements

  • Support multi-feature splits
  • Improve feature selection heuristics
  • Visualize the decision tree structure
  • Add probabilistic confidence scores to classifications

Notes

This project was completed as coursework. All implementation and design decisions are my own.

About

Recursive text classifier that builds and updates a decision tree to label documents based on extracted features.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages