# 1. Import libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
# import seaborn as sns # statistical data visualization
%matplotlib inline

In [None]:
import warnings

warnings.filterwarnings('ignore')

# 2. Import dataset

In [None]:
data = 'car_evaluation.csv'

df = pd.read_csv(data, header=None)

# 3. Exploratory data analysis

Lets explore the data to gain insights about the data. 

In [None]:
# view dimensions of dataset

df.shape

We can see that there are 1728 instances and 7 variables in the data set.

### View top 5 rows of dataset

In [None]:
# preview the dataset

df.head()

### Rename column names

We can see that the dataset does not have proper column names. The columns are merely labelled as 0,1,2.... and so on. We should give proper names to the columns. I will do it as follows:-

In [None]:
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']


df.columns = col_names

col_names

In [None]:
# let's again preview the dataset

df.head()

We can see that the column names are renamed. Now, the columns have meaningful names.

### View summary of dataset

In [None]:
df.info()

### Frequency distribution of values in variables

Now, I will check the frequency counts of categorical variables.

In [None]:
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']


for col in col_names:
    
    print(df[col].value_counts())   


We can see that the `doors` and `persons` are categorical in nature. So, I will treat them as categorical variables.

### Summary of variables


- There are 7 variables in the dataset. All the variables are of categorical data type.


- These are given by `buying`, `maint`, `doors`, `persons`, `lug_boot`, `safety` and `class`.


- `class` is the target variable.

### Explore `class` variable

In [None]:
df['class'].value_counts()

The `class` target variable is ordinal in nature.

### Missing values in variables

In [None]:
# check missing values in variables

df.isnull().sum()

We can see that there are no missing values in the dataset. I have checked the frequency distribution of values previously. It also confirms that there are no missing values in the dataset.

# 4. Declare feature vector and target variable

In [None]:
X = df.drop(['class'], axis=1)

y = df['class']

# 5. Split data into separate training and test set

In [None]:
# split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42, stratify=y)


In [None]:
# check the shape of X_train and X_test

X_train.shape, X_test.shape

# 6. Feature Engineering

Feature Engineering is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.


First, I will check the data types of variables again.

In [None]:
# check data types in X_train

X_train.dtypes

### Encode categorical variables


Now, I will encode the categorical variables.

In [None]:
X_train.head()

We can see that all  the variables are ordinal categorical data type.

In [None]:
# import category encoders

import category_encoders as ce

In [None]:
# encode variables with ordinal encoding

encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])


X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [None]:
X_train.head()

In [None]:
X_test.head()

We now have training and test set ready for model building. 

# 7. Decision Tree Classifier with criterion gini index

In [None]:
# import DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier


In [None]:
# instantiate the DecisionTreeClassifier model with criterion gini index

clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)


# fit the model
clf_gini.fit(X_train, y_train)


### Predict the Test set results with criterion gini index

In [None]:
y_pred_gini = clf_gini.predict(X_test)


### Check accuracy score with criterion gini index

In [None]:
from sklearn.metrics import accuracy_score

print('Model accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))

Here, y_test are the true class labels and y_pred_gini are the predicted class labels in the test-set.

### Compare the train-set and test-set accuracy


Now, I will compare the train-set and test-set accuracy to check for overfitting.

In [None]:
y_pred_train_gini = clf_gini.predict(X_train)

y_pred_train_gini

In [None]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_gini)))

### Check for overfitting and underfitting

In [None]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(clf_gini.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(clf_gini.score(X_test, y_test)))

Here, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting. 


### Visualize decision-trees

In [None]:
plt.figure(figsize=(12,8))

from sklearn import tree

tree.plot_tree(clf_gini.fit(X_train, y_train)) 

# 8. Decision Tree Classifier with criterion entropy

In [None]:
# instantiate the DecisionTreeClassifier model with criterion entropy

clf_en = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)


# fit the model
clf_en.fit(X_train, y_train)

### Predict the Test set results with criterion entropy

In [None]:
y_pred_en = clf_en.predict(X_test)

### Check accuracy score with criterion entropy

In [None]:
from sklearn.metrics import accuracy_score

print('Model accuracy score with criterion entropy: {0:0.4f}'. format(accuracy_score(y_test, y_pred_en)))

### Compare the train-set and test-set accuracy


Now, I will compare the train-set and test-set accuracy to check for overfitting.

In [None]:
y_pred_train_en = clf_en.predict(X_train)

y_pred_train_en

In [None]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_en)))

### Check for overfitting and underfitting

In [None]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(clf_en.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(clf_en.score(X_test, y_test)))

We can see that the training-set score and test-set score is same as above. The training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting. 


### Visualize decision-trees

In [None]:
plt.figure(figsize=(12,8))

from sklearn import tree

tree.plot_tree(clf_en.fit(X_train, y_train)) 

Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.


But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making. 


We have another tool called `Confusion matrix` that comes to our rescue.

# 9. Confusion matrix

A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.


Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-


True Positives (TP) – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.


True Negatives (TN) – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.


False Positives (FP) – False Positives occur when we predict an observation belongs to a    certain class but the observation actually does not belong to that class. This type of error is called Type I error.



False Negatives (FN) – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called Type II error.



These four outcomes are summarized in a confusion matrix given below.


In [None]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_en)

print('Confusion matrix\n\n', cm)



# 10. Classification Report

Classification report is another way to evaluate the classification model performance. It displays the  precision, recall, f1 and support scores for the model. I have described these terms in later.

We can print a classification report as follows:-

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_en))