<a href="https://colab.research.google.com/github/iris2311/car-quality/blob/main/car_quality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, we will import libraries we will use through this project.

In [212]:
!pip install matplotlib
!pip install ucimlrepo
import numpy
import matplotlib.pyplot
import pandas



In [213]:
from ucimlrepo import fetch_ucirepo
                                        #this code was found on the same page where the dataset
car_evaluation = fetch_ucirepo(id=19)    #database

X = car_evaluation.data.features         #X features
y = car_evaluation.data.targets          #Y feature

print(car_evaluation.metadata)

print(car_evaluation.variables)

{'uci_id': 19, 'name': 'Car Evaluation', 'repository_url': 'https://archive.ics.uci.edu/dataset/19/car+evaluation', 'data_url': 'https://archive.ics.uci.edu/static/public/19/data.csv', 'abstract': 'Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods.', 'area': 'Other', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 1728, 'num_features': 6, 'feature_types': ['Categorical'], 'demographics': [], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1988, 'last_updated': 'Thu Aug 10 2023', 'dataset_doi': '10.24432/C5JP48', 'creators': ['Marko Bohanec'], 'intro_paper': {'title': 'Knowledge acquisition and explanation for multi-attribute decision making', 'authors': 'M. Bohanec, V. Rajkovič', 'published_in': '8th Intl Workshop on Expert Systems and their Applications, Avignon, France', 'yea

From the code output above, we can see that all variables are categorical and there is no missing values. In this project we will split data into train set (80%) and test set (20%) by using function from sklearn library.


Decision tree algorithm works with numerical data, so we need to encode categorical variables which represents car quality. It is important that we do that before spliting the data into train and test set.

In [214]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
x_transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])], remainder='passthrough')
y_transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), ['class'])], remainder='passthrough')
X=x_transformer.fit_transform(X).toarray()
y=y_transformer.fit_transform(y).toarray()


#attributes of ColumnTransformer: first element of transformers is type of transformation
#second is type of encoding
#third is columns on which the transformation is made

In [215]:
from sklearn.model_selection import train_test_split

#X_train and Y_train is the train set
#X_test and Y_test is the test set

X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.2, random_state=1)    #test size is 0.2 because test set size is 20%
                                                                                          #random state=1 means data will be shuffled

Printing train set:

In [216]:
print(X_train)
print(Y_train)

[[0. 1. 0. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 1. 0. 0.]
 ...
 [0. 0. 1. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 1. ... 1. 0. 0.]]
[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 ...
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]


Printing test set:

In [217]:
print(X_test)
print(Y_test)

[[0. 0. 1. ... 0. 1. 0.]
 [1. 0. 0. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 1.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]
[[0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 ...
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]


Because there is no missing data, this is the end of the data preprocessing phase. Decision trees and random forests are not sensitive to feature scaling because their splits don’t change with any monotonic transformation, so we don't need to apply feature scaling.   (https://forecastegy.com/posts/do-decision-trees-need-feature-scaling-or-normalization/)

In [218]:
from sklearn.tree import DecisionTreeClassifier     #Decision tree is already implemented in the sklearn library
classifier = DecisionTreeClassifier(criterion="entropy", random_state=0)
                                                      #criterion is function to measure the quality of a split

classifier.fit(X_train, Y_train)  #we have to train our model on the train set data

y_pred = classifier.predict(X_test)    #we will store predictions for Y_test in y_pred


To calculate accuracy of Decision Tree algorithm on our test set, we will use confusion matrix and accuracy score function.


Confusion Matrix and Accuracy score for Decision Tree algorithm:

In [219]:
from sklearn.metrics import confusion_matrix, accuracy_score
confusion_m = confusion_matrix(Y_test.argmax(axis=1), y_pred.argmax(axis=1))
print(confusion_m)
accuracy_score(Y_test, y_pred)

[[ 54   0   4   1]
 [  1  12   1   0]
 [  6   0 254   0]
 [  1   1   0  11]]


0.9566473988439307

Now we will run Random Forest Classification algorithm on our dataset.

In [236]:
from sklearn.ensemble import RandomForestClassifier   #we have to import the classifier
random_forest = RandomForestClassifier(n_estimators=800, criterion ='entropy', random_state=0)
random_forest.fit(X_train, Y_train)   #fitting model on train set

#n_estimators is number of decision trees that are built

Prediction test set results:

In [237]:
y_pred_random_forest = random_forest.predict(X_test)

Confusion Matrix and Accuracy score for Random Forest algorithm:

In [238]:
confusion_m_random = confusion_matrix(Y_test.argmax(axis=1), y_pred_random_forest.argmax(axis=1))
print(confusion_m_random)
accuracy_score(Y_test, y_pred_random_forest)

[[ 57   1   0   1]
 [  3  11   0   0]
 [ 11   0 249   0]
 [  3   0   0  10]]


0.9393063583815029

XGBOOST:<

In [248]:
from sklearn.metrics import classification_report
from xgboost import XGBRFClassifier

xgboost_classifier = XGBRFClassifier()

# Fit the classifier to the training data
xgboost_classifier.fit(X_train, Y_train)
y_xgboost= xgbrf_classifier.predict(X_test)

accuracy_score(Y_test, y_xgboost)

0.9161849710982659