## Overview

The goal of this project is to predict car evaluations based on the characteristics provided in the car dataset in UCI Machine Learning repository. This dataset has 1728 records, each record representing a car evaluation. Each car evaluation is described with 7 attributes. 6 of the attributes represent car characteristics, such as buying price, price of the maintenance, number of doors, capacity in terms of persons to carry, the size of luggage boot, and estimated safety of the car. The seventh variable represents the evaluation of the car (unacceptable, acceptable, good, very good).

We will be using 5 different classification algorithms shown below - 

1. K-Nearest Neighbors
2. Decision Trees
3. Naive Bayes
4. Logistic Regression
5. Support Vector Machines

## Importing Packages

We will use Sklearn, numpy and pandas to produce model results and use matplotlib for visualizations of our model results.

In [1]:
%matplotlib inline
from sklearn import tree, linear_model, neighbors, svm
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, KFold
from sklearn.metrics import confusion_matrix, classification_report, f1_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import warnings
warnings.filterwarnings("ignore")

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

# path = "/content/drive/MyDrive/Data/car.csv"
# car = pd.read_csv(path)

## Reading files and cleaning

I read in the csv file and assign names to each column. I also look at some basic information about the data using the describe and info methods.

In [3]:
car = pd.read_csv('car.csv', header = None)
car.columns = ['buying','maint','doors','persons','lug_boot','safety','evaluation']
car.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,evaluation
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [4]:
## Looking at a summary of the columns
car.describe()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,evaluation
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,vhigh,vhigh,2,2,small,low,unacc
freq,432,432,432,576,576,576,1210


In [5]:
## Checking for nulls
print(car.info())
print(car.evaluation.value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   buying      1728 non-null   object
 1   maint       1728 non-null   object
 2   doors       1728 non-null   object
 3   persons     1728 non-null   object
 4   lug_boot    1728 non-null   object
 5   safety      1728 non-null   object
 6   evaluation  1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB
None
unacc    1210
acc       384
good       69
vgood      65
Name: evaluation, dtype: int64


## Dealing with categorical variables

All the features that we have for prediction are categorical variables. My next step is to encode the variables in a numerical form using the One Hot Encoding method. Every unique value of a feature will become a new column, resulting in 21 features totally. 

I've also encoded the target variable into a numeric form using the following mapping - 

1. unacc = 0
2. acc = 1
3. good = 2
4. vgood = 3

In [6]:
## Creating one-hot encoding for the categorical variables
X = car.iloc[:,0:6]
y = np.where(car['evaluation'] == 'unacc',0,
             np.where(car['evaluation'] == 'acc',
                      1,np.where(car['evaluation'] == 'good',2,3)))

encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

X_one_hot = encoder.fit_transform(X)
X_one_hot

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 1., 0., 0.]])

## Split into train, test and validation

To avoid overfitting, we will split our data into 2 parts - train and test. 

We will use the validation set to select the best model from all the different hyperparameters and then finally compare the results of all 3 different models on the testing dataset to decide which model works the best.

The split that I have chosen - 

- Train = 70% 
- Test = 30%

In [7]:
# Split into training, validation and testing set
#Train, test, validation split
random.seed(456)

X_train, X_rem, y_train, y_rem = train_test_split(X_one_hot,y, 
                                                  train_size=0.7, 
                                                  stratify = y)

X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, 
                                                    train_size=0.5, 
                                                    stratify = y_rem)

## Model Selection

We will use nested cross-validation to select the best model out of the 5 choices we have - KNN, Decision Tree, Logistic Regression, Naive Bayes and Support Vector Machine.

We follow the steps below for each model - 

1. We select the list of parameters we want to optimize over for each model and put it into a dictionary 
2. Set up the inner cross-validation object by using this parameter grid
3. Set up the outer cross-validation object.
5. Create a grid search object using the inner cv object. This will be used to find the best parameter for each outer fold.
4. Get the cross validation score using the grid search object as the estimator and the outer_cv as the cross validation parameter

In [8]:
### Create model objects

knn = make_pipeline(MinMaxScaler(), 
                    neighbors.KNeighborsClassifier(metric = 'hamming'))
dt = tree.DecisionTreeClassifier()
logit = make_pipeline(MinMaxScaler(), 
                      linear_model.LogisticRegression(max_iter = 10000, solver = 'saga'))
svm_classifier = make_pipeline(MinMaxScaler(), svm.SVC())
nb_classifier = MultinomialNB()


###### Create Parameter List

## KNN
k_range = list(range(1,10))
knn_params = dict(kneighborsclassifier__n_neighbors = k_range, 
                  kneighborsclassifier__weights = ['uniform','distance'], 
                  kneighborsclassifier__p = [1,2,3])

## Decision Tree
depth_range = list(range(1,10))
min_samples_range = list(range(2,10))
impurity_decrease_range = list(np.linspace(0.1,0.5,5))

dt_params = dict(criterion = ['gini','entropy'], 
                 splitter = ['best','random'],
                 max_depth = depth_range, 
                 max_features = ['auto','sqrt','log2',None], #
                 random_state = [456],
                 min_impurity_decrease = impurity_decrease_range,
                 class_weight = ['balanced',None])

## Logistic Regression

c_range = [0.1, 1, 10]
l1r_range = [0,0.5,1]

logit_params = dict(logisticregression__penalty = ['elasticnet'],
                    logisticregression__random_state = [456],
                    logisticregression__C = c_range,
                    logisticregression__class_weight = ['balanced', None],
                    logisticregression__l1_ratio = l1r_range)

## Support Vector Machine

svm_params = {"svc__C": c_range,
              "svc__gamma": [0.1, 1, 10],
              "svc__degree": list(range(1,5)),
              "svc__kernel": ['rbf'],
              "svc__class_weight" : ['balanced',None],
              "svc__decision_function_shape" : ['ovo','ovr'],
              "svc__break_ties" : [False],
              "svc__random_state" : [456]
            }

## Naive Bayes

nb_params = dict(alpha = list(np.linspace(0,1,11)),fit_prior = [True,False])


In [9]:
##### Set up inner and outerCV loops

inner_cv = KFold(n_splits = 5, shuffle = True, random_state = 456)
outer_cv = KFold(n_splits = 5, shuffle = True, random_state = 456)

### Use metric

scoring = 'f1_weighted'


### Create Grid Search estimators

knn_gs = GridSearchCV(estimator = knn, param_grid = knn_params, 
                      scoring = scoring, cv = inner_cv)
dt_gs = GridSearchCV(estimator = dt, param_grid = dt_params, 
                     scoring = scoring, cv = inner_cv)
logit_gs = GridSearchCV(estimator = logit, param_grid = logit_params, 
                        scoring = scoring, cv = inner_cv)
svm_gs = GridSearchCV(estimator = svm_classifier, param_grid = svm_params, 
                      scoring = scoring, cv = inner_cv)
nb_gs = GridSearchCV(estimator = nb_classifier, param_grid = nb_params, 
                     scoring = scoring, cv = inner_cv)

### performing Nested cross validation using the inner and outer loops
dt_score = cross_val_score(estimator = dt_gs, X = X_train, y = y_train, 
                           cv = outer_cv, scoring = scoring)
knn_score = cross_val_score(estimator = knn_gs, X = X_train, y = y_train, 
                            cv = outer_cv, scoring = scoring)
logit_score = cross_val_score(estimator = logit_gs, X = X_train, y = y_train, 
                              cv = outer_cv, scoring = scoring)
svm_score = cross_val_score(estimator = svm_gs, X = X_train, y = y_train, 
                            cv = outer_cv, scoring = scoring)
nb_score = cross_val_score(estimator = nb_gs, X = X_train, y = y_train, 
                           cv = outer_cv, scoring = scoring)

In [10]:
print("knn = " + str(np.round(knn_score.mean(),2)) + " " + u"\u00B1" + " " 
      + str(np.round(knn_score.std(),2)))
print("dt = " + str(np.round(dt_score.mean(),2)) + " " + u"\u00B1" + " " 
      + str(np.round(dt_score.std(),2)))
print("logit = " + str(np.round(logit_score.mean(),2)) + " " + u"\u00B1" + " "
      + str(np.round(logit_score.std(),2)))
print("svm = " + str(np.round(svm_score.mean(),2)) + " " + u"\u00B1" + " "
      + str(np.round(svm_score.std(),2)))
print("nb = " + str(np.round(nb_score.mean(),2)) + " " + u"\u00B1" + " "
      + str(np.round(nb_score.std(),2)))

knn = 0.88 ± 0.01
dt = 0.77 ± 0.02
logit = 0.93 ± 0.01
svm = 0.99 ± 0.01
nb = 0.83 ± 0.01


## Best model

It looks like the Support Vector Machine does the best in terms of the f1-score. We will use the Support Vector Machine to create our final model.

In [11]:
svm_score

array([0.97530421, 0.99580341, 1.        , 0.99195894, 0.98065177])

## Re-training and evaluation

Now we re-train our SVM model on the entire training dataset and evaluate it on the testing set. 

Looking at the classification report below, here are some observations - 

1. The overall accuracy and f1-score is 0.99, which shows that our model performs generalizes well on the testing dataset.
2. In most classes, the model performs well. The 'good' class does not perform as well as other classes with an f1-score of 0.95 and a recall of 0.91.
3. Overall, the model does a fairly good job of predicting car evaluations decently well


In [12]:
## Looks like SVM performs the best

## Using SVM to train our data 
clf_svm = svm_gs.fit(X_train, y_train)

#Predict on test data
y_pred = clf_svm.predict(X_test)

print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       180
           1       1.00      0.98      0.99        59
           2       1.00      0.91      0.95        11
           3       1.00      1.00      1.00        10

    accuracy                           0.99       260
   macro avg       1.00      0.97      0.98       260
weighted avg       0.99      0.99      0.99       260



## Treating the data as numeric

What if we hadn't coded the data using OneHotEncoder and treated the variables as numeric variables instead. We code this in manually.

Once that is done, we will repeat the steps of nested cross validation for model selection and then re-train our best model to evaluate on the test data and compare which type of model is the best. 

In [20]:
## Creating a label encoding for the categorical variables
from sklearn.preprocessing import LabelEncoder

X = car.iloc[:,0:6]
y = np.where(car['evaluation'] == 'unacc',0,np.where(car['evaluation'] == 'acc',1,np.where(car['evaluation'] == 'good',2,3)))


# https://sparkbyexamples.com/pandas/pandas-remap-values-in-column-with-a-dictionary-dict/#:~:text=Using%20Pandas%20DataFrame.-,replace(),regular%20expressions%20for%20regex%20substitutions.

X['buying'].replace({'vhigh':3, 'high':2, 'med':1, 'low':0}, inplace=True)
X['maint'].replace({'vhigh':3, 'high':2, 'med':1, 'low':0}, inplace=True)
X['doors'].replace({'5more':3,'4':2, '3':1, '2':0}, inplace=True)
X['persons'].replace({'more':2, '4':1, '2':0}, inplace=True)
X['lug_boot'].replace({'big':2, 'med':1, 'small':0}, inplace=True)
X['safety'].replace({'high':2, 'med':1, 'low':0}, inplace=True)
X.head()

## Split into train and test_Sets

# Split into training, validation and testing set
#Train, test, validation split
random.seed(456)

X_train_l, X_rem_l, y_train_l, y_rem_l = train_test_split(X_label,y, train_size=0.7, stratify = y)

X_valid_l, X_test_l, y_valid_l, y_test_l = train_test_split(X_rem_l,y_rem_l, train_size=0.5, stratify = y_rem)

In [21]:
X_train_l.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
33,3,3,1,0,0,1
1221,2,1,1,0,0,1
850,0,1,3,1,1,2
1672,1,1,1,2,0,2
1470,1,0,2,1,1,1


In [23]:
### Create model objects

knn = make_pipeline(MinMaxScaler(), neighbors.KNeighborsClassifier())
dt = tree.DecisionTreeClassifier()
logit = make_pipeline(MinMaxScaler(), linear_model.LogisticRegression(max_iter = 10000, solver = 'saga'))
svm_classifier = make_pipeline(MinMaxScaler(), svm.SVC())
nb_classifier = MultinomialNB()

##### Set up inner and outerCV loops

inner_cv = KFold(n_splits = 5, shuffle = True, random_state = 456)
outer_cv = KFold(n_splits = 5, shuffle = True, random_state = 456)

### Use metric

scoring = 'f1_weighted'


### Create Grid Search estimators

knn_gs = GridSearchCV(estimator = knn, param_grid = knn_params, 
                      scoring = scoring, cv = inner_cv)
dt_gs = GridSearchCV(estimator = dt, param_grid = dt_params, 
                     scoring = scoring, cv = inner_cv)
logit_gs = GridSearchCV(estimator = logit, param_grid = logit_params, 
                        scoring = scoring, cv = inner_cv)
svm_gs = GridSearchCV(estimator = svm_classifier, param_grid = svm_params, 
                      scoring = scoring, cv = inner_cv)
nb_gs = GridSearchCV(estimator = nb_classifier, param_grid = nb_params, 
                     scoring = scoring, cv = inner_cv)

### performing Nested cross validation using the inner and outer loops
dt_score = cross_val_score(estimator = dt_gs, X = X_train_l, y = y_train_l, 
                           cv = outer_cv, scoring = scoring)
knn_score = cross_val_score(estimator = knn_gs, X = X_train_l, y = y_train_l, 
                            cv = outer_cv, scoring = scoring)
logit_score = cross_val_score(estimator = logit_gs, X = X_train_l, y = y_train_l, 
                              cv = outer_cv, scoring = scoring)
svm_score = cross_val_score(estimator = svm_gs, X = X_train_l, y = y_train_l, 
                            cv = outer_cv, scoring = scoring)
nb_score = cross_val_score(estimator = nb_gs, X = X_train_l, y = y_train_l, 
                           cv = outer_cv, scoring = scoring)

print("knn = " + str(np.round(knn_score.mean(),2)) + " " + u"\u00B1" + " " 
      + str(np.round(knn_score.std(),2)))
print("dt = " + str(np.round(dt_score.mean(),2)) + " " + u"\u00B1" + " " 
      + str(np.round(dt_score.std(),2)))
print("logit = " + str(np.round(logit_score.mean(),2)) + " " + u"\u00B1" + " "
      + str(np.round(logit_score.std(),2)))
print("svm = " + str(np.round(svm_score.mean(),2)) + " " + u"\u00B1" + " "
      + str(np.round(svm_score.std(),2)))
print("nb = " + str(np.round(nb_score.mean(),2)) + " " + u"\u00B1" + " "
      + str(np.round(nb_score.std(),2)))

knn = 0.88 ± 0.02
dt = 0.77 ± 0.02
logit = 0.64 ± 0.04
svm = 0.96 ± 0.02
nb = 0.59 ± 0.02


## Retraining and evaluating on the test set

The model that does the best is still the SVM. However, it seems like the overall model accuracies have taken a hit.


Let's retrain the model on the testing set to see the test accuracies.

In [25]:
## Looks like SVM performs the best

## Using SVM to train our data 
clf_svm_l = svm_gs.fit(X_train_l, y_train_l)

#Predict on test data
y_pred = clf_svm_l.predict(X_test_l)

print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.71      0.71      0.71       183
           1       0.26      0.26      0.26        57
           2       0.00      0.00      0.00         9
           3       0.10      0.09      0.10        11

    accuracy                           0.56       260
   macro avg       0.27      0.27      0.27       260
weighted avg       0.56      0.56      0.56       260



## Conclusions and comparisons with oneHot transformation

Looking at the results above, it look like SVM is still the best model using this kind of transformation. However, the model accuracies/f1-scores have taken a hit. 

The new model has a very low weighted f-1 score of 0.56 vs the onehot model of 0.99. Using a one-hot encoding is definitely better than using a regular numeric coding in this particular case. Some of the more general pros and cons of using this numerical coding are listed below -

Pros - 
1. The number of features are much less. This will improve the model training process by being more computationally efficient. 
2. If the data has an inherent order, it might be more useful to use and the difference between each level is the same, we can use a numerical encoding method.

Cons - 
1. If the data is of nominal type, it does not make sense to encode it in a numerical format. This is because the difference between each level might not be the same and the model will estimate its paramenters assuming that the data is interval level.
