# A classifier for odd and even

This is weird, but it helps to emphasise the importance of data representation.

I asked ChatGPT to write me some code for a decision tree to classify odd and even numbers. It obliged, creating a dataset of integers labelled as odd and even. I upped the number of samples in the dataset to 10k, but otherwise used the dataset-generating code directly. It gives a pandas dataframe with one column "numbers" and one column "label", where "label" is "odd" or "even". ChatGPT also gave me a fairly standard decision tree and said that it should get 100% accuracy because odd/even classification is a fairly easy task. But, of course, the decision tree failed utterly because a decision tree can't model odd/even with an integer data representation. It needs a binary representation.

So I gave it a binary representation. The weird part is that, with a binary representation, single feature as a list or as a string, the model gets a fairly consistent level of accuracy, but it's not 100%. It depends on the dataset split, and might depend on the depth of the decision tree, too, I haven't tried yet. But why? What's the maths and is it complicated? The decision tree splits along "nicer" values - when you give it integers, the tree splits along fairly random values, but with a binary string, they *seem* to be nicer.

Naturally splitting the single binary string into one feature for every digit (character) in the string gives a classifier with perfect accuracy - it just classifies based on the least significant digit, as it should. But ChatGPT didn't know to do this out of the box.

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import export_text


In [52]:
# Generate some odds and evens
odds = list(range(1, 10000, 2))   
evens = list(range(0, 10000, 2)) 

# Create DataFrames with labels
df_odds = pd.DataFrame({"number": odds, "label": "Odd"})
df_evens = pd.DataFrame({"number": evens, "label": "Even"})

# Combine them into one DataFrame
df = pd.concat([df_odds, df_evens], ignore_index=True)
bit_accuracy = 32

def binarise(x):
    return bin(x)[2:].zfill(bit_accuracy)

def binarise_to_list(x):
    s = bin(x)[2:].zfill(bit_accuracy)
    return list(s)

df["bin"] = df["number"].apply(np.vectorize(binarise))
df["bin list"] = df["number"].apply(np.vectorize(binarise_to_list))
df["bin"] = df["bin"].astype(str)

## Putting the binary values into columns
# Split strings into characters and expand into new columns
df_chars = df["bin list"].apply(list).apply(pd.Series)

# Rename columns (optional: C1, C2, C3...)
df_chars.columns = [f"char_{i+1}" for i in df_chars.columns]

# Combine back with original DataFrame (if you want)
df = pd.concat([df, df_chars], axis=1)

print(df.head(10))


   number label                               bin  \
0       1   Odd  00000000000000000000000000000001   
1       3   Odd  00000000000000000000000000000011   
2       5   Odd  00000000000000000000000000000101   
3       7   Odd  00000000000000000000000000000111   
4       9   Odd  00000000000000000000000000001001   
5      11   Odd  00000000000000000000000000001011   
6      13   Odd  00000000000000000000000000001101   
7      15   Odd  00000000000000000000000000001111   
8      17   Odd  00000000000000000000000000010001   
9      19   Odd  00000000000000000000000000010011   

                                            bin list char_1 char_2 char_3  \
0  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...      0      0      0   
1  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...      0      0      0   
2  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...      0      0      0   
3  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...      0      0      0   
4  [0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [59]:
data_representation = "binary"

# Features (X) and target (y)
X = df[["number"]]
feature_names = ["number"]

if data_representation == "binary":
    X = df[["bin"]]
    feature_names = ["bin"]

if data_representation == "binary columns":
    bit_features = df.iloc[:, (0-bit_accuracy):]
    X = bit_features
    feature_names = df.columns[-32:]


y = df["label"].map({"Odd": 1, "Even": 0})  # safer as numeric


# Split into train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [62]:
# Train Decision Tree
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Visualise
tree_rules = export_text(clf, feature_names=feature_names)
print(tree_rules)


Accuracy: 0.31

Classification Report:
               precision    recall  f1-score   support

           0       0.34      0.40      0.37      1000
           1       0.27      0.22      0.24      1000

    accuracy                           0.31      2000
   macro avg       0.30      0.31      0.30      2000
weighted avg       0.30      0.31      0.30      2000

|--- bin <= 10055.50
|   |--- bin <= 0.50
|   |   |--- class: 0
|   |--- bin >  0.50
|   |   |--- bin <= 10000.50
|   |   |   |--- bin <= 5555.50
|   |   |   |   |--- bin <= 100.50
|   |   |   |   |   |--- bin <= 5.50
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |--- bin >  5.50
|   |   |   |   |   |   |--- bin <= 10.50
|   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |   |--- bin >  10.50
|   |   |   |   |   |   |   |--- bin <= 55.50
|   |   |   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |   |   |--- bin >  55.50
|   |   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |--- bin >  10

In [None]:
from sklearn.svm import SVC
# Train SVC
clf = SVC(kernel="linear", random_state=42)
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))