## Simple Car Safety Analysis and Prediction

Here is a prediction of evaluation of car safety using basic logistic regression. 

In [87]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn import metrics

#features
#buying: vhigh, high, med, low.
#maint: vhigh, high, med, low.
#doors: 2, 3, 4, 5more.
#persons: 2, 4, more.
#lug_boot: small, med, big.
#safety:low, med, high

#target class values
#label: unacc, acc, good, vgood

df = pd.read_csv("car.data", names=["buying", "maint", 'doors', "persons", "lug_boot", "safety", "label"])

While there is some noticeable imbalance in labels, it doesn't seem to warrant using something like SMOTE

In [67]:
df['label'].describe()

count      1728
unique        4
top       unacc
freq       1210
Name: label, dtype: object

**Source**:
https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

There are no null nor NA values.

In [45]:
df.isna().values.any()

False

In [46]:
df.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
label       0
dtype: int64

In [47]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,label
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [48]:
df.shape

(1728, 7)

`df[["doors", "persons"]] = df[["doors", "persons"]].astype(int)`
Trying to convert the doors and persons columns into integers will inevitably result in an error being thrown as some values are not pure integers as shown in the following cell. We can either try to convert them into integers or encode them.

In [49]:
print(df["doors"].unique())
print(df["persons"].unique())

['2' '3' '4' '5more']
['2' '4' 'more']


In [50]:
df = df.astype(str)

In [89]:
X = df.drop(["label"], axis = 1)
y = df.label

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2)

In [90]:
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

label_encoder = LabelEncoder()
for col in list(X.columns):
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])



In [53]:
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
1549,low,med,3,4,small,med
935,med,vhigh,4,4,big,high
824,high,low,4,4,med,high
323,vhigh,med,5more,more,big,high
1592,low,med,4,more,big,high


In [54]:
label_X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
1549,1,2,1,1,2,2
935,2,3,2,1,0,0
824,0,1,2,1,1,0
323,3,2,3,2,0,0
1592,1,2,2,2,0,0


Here is a logistic regression L2 regularized by default. Non regularization is not supported unfortunately.

In [78]:
logreg = LogisticRegression()
logreg.fit(label_X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [79]:
y_pred = logreg.predict(label_X_valid)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(label_X_valid, y_valid)))

Accuracy of logistic regression classifier on test set: 0.71


There is no improvement using Lasso.

In [80]:
logreg = LogisticRegression(penalty = 'l1')
logreg.fit(label_X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [81]:
y_pred = logreg.predict(label_X_valid)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(label_X_valid, y_valid)))

Accuracy of regularized logistic regression classifier on test set: 0.71


Also try cross validation as the number of observations is small enough. Training a model with cross validation actually does worse in this case, but may perform better for larger datasets by avoiding overfitting.

In [95]:
clf = LogisticRegressionCV(cv=5).fit(label_X_train, y_train)
clf.predict(label_X_valid)
clf.score(label_X_valid, y_valid)



0.6705202312138728

Unfortunately multiclass values are not supported by ROC scores