This project implements a text classifier in Java using a decision-tree representation. The classifier is trained on labeled text data loaded from CSV files, builds a tree using word-frequency features and numeric thresholds, and can classify new text inputs by traversing the learned tree.
The classifier also supports saving and reloading trained models using a preorder tree format.
This project was completed as part of CSE 123 (Computer Programming) at the University of Washington.
- Decision-tree classifier with recursive traversal
- Word-frequency feature extraction
- Threshold-based branching decisions
- CSV-based training and testing data loading
- Save and reload trained classifier trees
- Accuracy evaluation on test datasets
- Binary trees
- Recursion
- Object-oriented design
- File I/O
- Feature extraction from text
- Data-driven algorithm design
- Classifier.java // Core classifier logic and tree operations
- TextBlock.java // Represents labeled text and word-frequency data
- DataLoader.java // Loads training and testing data
- CsvReader.java // Parses CSV files into usable records
- Client.java // Runs training, testing, and classification
- Each document is represented as a
TextBlock. - A
TextBlockstores:- A label (classification category)
- A mapping of words to occurrence counts
- The total number of words in the document
- Word probabilities are computed as: count(word) / totalWords
- The classifier is trained on a list of labeled
TextBlockobjects. - Internal decision nodes store:
- A feature word
- A numeric threshold
- During training, examples that are misclassified cause the tree to grow by introducing a new decision node that separates examples based on whether the word probability meets the threshold.
- To classify new text, the classifier:
- Starts at the root of the tree
- Evaluates the stored feature against its threshold
- Recursively traverses left or right
- Returns the label stored at the leaf node
- The classifier can be saved to a file using a preorder traversal format.
- Each node is written as either:
- A feature/threshold pair (branch node), or
- A label (leaf node)
- The saved file can later be reloaded to reconstruct the exact same tree structure.
- Load training data from a CSV file
- Train the classifier on labeled text examples
- Evaluate accuracy on a separate test dataset
- Save the trained classifier to disk
- Reload the classifier and classify new inputs
- How decision trees encode conditional logic
- How recursion simplifies tree traversal and reconstruction
- How feature selection and thresholds affect classification behavior
- How to design programs that separate data loading, modeling, and execution
- Support multi-feature splits
- Improve feature selection heuristics
- Visualize the decision tree structure
- Add probabilistic confidence scores to classifications
This project was completed as coursework. All implementation and design decisions are my own.