Sure, here's an example code to build a Decision Tree classifier in Python for the given dataset:

In [7]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
df = pd.read_csv("repurchase_training.csv")
df.dropna()

Unnamed: 0,ID,Target,age_band,gender,car_model,car_segment,age_of_vehicle_years,sched_serv_warr,non_sched_serv_warr,sched_serv_paid,non_sched_serv_paid,total_paid_services,total_services,mth_since_last_serv,annualised_mileage,num_dealers_visited,num_serv_dealer_purchased
0,1,0,3. 35 to 44,Male,model_1,LCV,9,2,10,3,7,5,6,9,8,10,4
6,8,0,1. <25,Male,model_3,Large/SUV,8,2,8,2,9,9,4,7,6,10,4
34,42,0,2. 25 to 34,Female,model_2,Small/Medium,5,10,6,9,7,8,9,8,9,3,10
38,46,0,4. 45 to 54,Female,model_2,Small/Medium,7,8,2,8,2,5,6,6,9,7,7
51,61,0,2. 25 to 34,Female,model_7,LCV,6,4,4,4,6,5,4,8,10,7,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131110,153876,1,3. 35 to 44,Male,model_5,Large/SUV,1,1,2,3,1,2,2,4,6,7,9
131130,153898,1,2. 25 to 34,Male,model_8,Large/SUV,2,3,5,4,4,3,3,4,5,4,10
131185,153962,1,3. 35 to 44,Male,model_5,Large/SUV,2,2,8,2,7,5,6,1,6,4,10
131236,154021,1,3. 35 to 44,Male,model_4,Small/Medium,4,2,6,1,5,3,4,2,7,10,10


In [10]:
# Split into features and target
X = df.drop(columns=["ID", "Target", "age_band", "gender", "car_model", "car_segment"])
y = df["Target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree classifier
clf = DecisionTreeClassifier()

# Train the classifier on the training set
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9880843611999391


You can further fine-tune the hyperparameters of the Decision Tree classifier using techniques such as grid search or random search. For example, to perform a grid search over a range of possible hyperparameters, you can use the GridSearchCV class from scikit-learn:

In [11]:
from sklearn.model_selection import GridSearchCV

# Define the range of hyperparameters to search over
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [2, 4, 6, 8],
    "min_samples_split": [2, 4, 6, 8],
    "min_samples_leaf": [1, 2, 4]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy"
)

# Perform the grid search on the training set
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and accuracy score found
print("Best hyperparameters:", grid_search.best_params_)
print("Accuracy:", grid_search.best_score_)

Best hyperparameters: {'criterion': 'entropy', 'max_depth': 8, 'min_samples_leaf': 1, 'min_samples_split': 4}
Accuracy: 0.9873511691364781


This will perform a grid search over a range of possible hyperparameters and return the best combination of hyperparameters that maximizes the accuracy of the model on the training set. You can then use the best hyperparameters to train a final model and evaluate its performance on the test set.