We're going to rely on scatter plots for some of our analysis, so let's install a package that will help us do meaningful dimension reduction.

In [1]:
!pip install --quiet umap-learn

Let's load our data; we will defer doing any feature engineering for the moment.

In [2]:
import pandas as pd

CARS = '/kaggle/input/car-evaluation-classification/cars.csv'
df = pd.read_csv(filepath_or_buffer=CARS)
TARGET = 'class'
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


Is our target class balanced?

In [3]:
df[TARGET].value_counts().to_frame().T

class,unacc,acc,good,vgood
count,1210,384,69,65


Most cars are unacceptable; we will probably have a hard time predicting good or very good cars reliably.

Let's add a bunch of dummy variables to make all of our features numerical or numeric-like.

In [4]:
data_df = pd.get_dummies(data=df, columns=['doors', 'persons', 'buying', 'maint', 'lug_boot', 'safety'])
COLUMNS = ['doors_2', 'doors_3', 'doors_4', 'doors_5more', 'persons_2',
       'persons_4', 'persons_more', 'buying_high', 'buying_low', 'buying_med',
       'buying_vhigh', 'maint_high', 'maint_low', 'maint_med', 'maint_vhigh',
       'lug_boot_big', 'lug_boot_med', 'lug_boot_small', 'safety_high',
       'safety_low', 'safety_med']
data_df.head()

Unnamed: 0,class,doors_2,doors_3,doors_4,doors_5more,persons_2,persons_4,persons_more,buying_high,buying_low,...,maint_high,maint_low,maint_med,maint_vhigh,lug_boot_big,lug_boot_med,lug_boot_small,safety_high,safety_low,safety_med
0,unacc,True,False,False,False,True,False,False,False,False,...,False,False,False,True,False,False,True,False,True,False
1,unacc,True,False,False,False,True,False,False,False,False,...,False,False,False,True,False,False,True,False,False,True
2,unacc,True,False,False,False,True,False,False,False,False,...,False,False,False,True,False,False,True,True,False,False
3,unacc,True,False,False,False,True,False,False,False,False,...,False,False,False,True,False,True,False,False,True,False
4,unacc,True,False,False,False,True,False,False,False,False,...,False,False,False,True,False,True,False,False,False,True


Now let's do dimension reduction to see if our data clusters naturally.

In [5]:
import arrow
from umap import UMAP

time_start = arrow.now()
reducer = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
data_df[['x', 'y']] = reducer.fit_transform(X=data_df[COLUMNS])
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:15.677619


In [6]:
from plotly import express
express.scatter(data_frame=data_df, x='x', y='y', color=TARGET, height=600)

Boy does it ever; we get a lot of local clustering, and a lot of separation between our clusters. But the clusters don't generally correspond to our classes. Let's build a model and see the bad news.

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(data_df[COLUMNS], data_df[TARGET], test_size=0.2, random_state=2024, stratify=data_df[TARGET])

logreg = LogisticRegression(max_iter=100, tol=1e-12).fit(X_train, y_train)
print('model fit in {} iterations'.format(logreg.n_iter_[0]))
print('accuracy: {:5.4f} f1: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=logreg.predict(X=X_test)), f1_score(average='weighted', y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0)))
print(classification_report(y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0))

model fit in 86 iterations
accuracy: 0.9133 f1: 0.9062
              precision    recall  f1-score   support

         acc       0.83      0.83      0.83        77
        good       0.50      0.21      0.30        14
       unacc       0.96      0.98      0.97       242
       vgood       0.75      0.92      0.83        13

    accuracy                           0.91       346
   macro avg       0.76      0.74      0.73       346
weighted avg       0.90      0.91      0.91       346



Our accuracy/f1 is good, but this is driving mostly by the fact that we are predicting unacceptable cars well.

We can do somewhat better with a neural net.

In [8]:
from sklearn.neural_network import MLPClassifier

neural = MLPClassifier(alpha=1, max_iter=1000, random_state=2024).fit(X=X_train, y=y_train)
print('accuracy: {:5.4f} f1: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=neural.predict(X=X_test)), f1_score(average='micro', y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0)))
print(classification_report(y_true=y_test, y_pred=neural.predict(X=X_test), zero_division=0))

accuracy: 0.9740 f1: 0.9133
              precision    recall  f1-score   support

         acc       0.95      0.97      0.96        77
        good       0.91      0.71      0.80        14
       unacc       1.00      1.00      1.00       242
       vgood       0.79      0.85      0.81        13

    accuracy                           0.97       346
   macro avg       0.91      0.88      0.89       346
weighted avg       0.97      0.97      0.97       346



Let's take a look at the model results through the lens of the model probabilities.

In [9]:
probability_df = pd.DataFrame(data=neural.predict_proba(X=X_test).max(axis=1), columns=['probability'])
probability_df['true'] = y_test.tolist()
probability_df['pred'] = neural.predict(X=X_test)
probability_df['correct'] = probability_df['true'] == probability_df['pred']
probability_df[['x', 'y']] = reducer.transform(X=X_test)

probability_df.head()

Unnamed: 0,probability,true,pred,correct,x,y
0,0.768072,acc,acc,True,3.460157,3.953948
1,0.999126,unacc,unacc,True,7.409603,7.377582
2,0.992469,unacc,unacc,True,3.625695,6.544508
3,0.999995,unacc,unacc,True,6.999966,5.743499
4,0.999791,unacc,unacc,True,5.101852,7.833376


What does our mean model probability look like conditioned on whether the prediction is correct?

In [10]:
probability_df[['probability', 'correct']].groupby(by=['correct']).mean().to_dict()

{'probability': {False: 0.5647966183387692, True: 0.9153586805266136}}

In other words, when our model is correct is produces a high probability most of the time; when it is incorrect it produces a low probability.

In [11]:
express.histogram(data_frame=probability_df, x='probability', facet_col='correct', )

We don't have a lot of incorrect predictions, but let's look at our predictions in terms of the scatter plot we made above, with the model probabilities.

In [12]:
express.scatter(data_frame=probability_df, x='x', y='y', color='probability', facet_col='correct', hover_name='true')

Our incorrect predictions are mostly acceptable/good cars that UMAP projects near unacceptable cars, meaning that they are probably marginal cars. And not surprisingly the correct predictions with low model probabilities are mostly good/acceptable cars.