<a href="https://colab.research.google.com/github/iris2311/car-quality/blob/encoding/car_quality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, we will import libraries we will use through this project.

In [68]:
!pip install matplotlib
!pip install ucimlrepo
import numpy
import matplotlib.pyplot
import pandas



In [69]:
from ucimlrepo import fetch_ucirepo
                                        #this code was found on the same page where the dataset
car_evaluation = fetch_ucirepo(id=19)    #database

X = car_evaluation.data.features         #X features
y = car_evaluation.data.targets          #Y feature

print(car_evaluation.metadata)

print(car_evaluation.variables)

{'uci_id': 19, 'name': 'Car Evaluation', 'repository_url': 'https://archive.ics.uci.edu/dataset/19/car+evaluation', 'data_url': 'https://archive.ics.uci.edu/static/public/19/data.csv', 'abstract': 'Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods.', 'area': 'Other', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 1728, 'num_features': 6, 'feature_types': ['Categorical'], 'demographics': [], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1988, 'last_updated': 'Thu Aug 10 2023', 'dataset_doi': '10.24432/C5JP48', 'creators': ['Marko Bohanec'], 'intro_paper': {'title': 'Knowledge acquisition and explanation for multi-attribute decision making', 'authors': 'M. Bohanec, V. Rajkovič', 'published_in': '8th Intl Workshop on Expert Systems and their Applications, Avignon, France', 'yea

From the code output above, we can see that all variables are categorical and there is no missing values. In this project we will split data into train set (80%) and test set (20%) by using function from sklearn library.


Decision tree algorithm works with numerical data, so we need to encode categorical variables which represents car quality. It is important that we do that before spliting the data into train and test set.

In [70]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
x_transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'] )])
y_transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])])
X=x_transformer.fit_transform(X).toarray()
y=y_transformer.fit_transform(y).toarray()


#attributes of ColumnTransformer: first element of transformers is type of transformation
#second is type of encoding
#third is columns on which the transformation is made

In [71]:
from sklearn.model_selection import train_test_split

#X_train and Y_train is the train set
#X_test and Y_test is the test set

X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.2, random_state=1)    #test size is 0.2 because test set size is 20%
                                                                                          #random state=1 means data will be shuffled

Printing train set:

In [72]:
print(X_train)
print(Y_train)

[[0. 1. 0. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 1. 0. 0.]
 ...
 [0. 0. 1. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 1. ... 1. 0. 0.]]
[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 ...
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]


Printing test set:

In [73]:
print(X_test)
print(Y_test)

[[0. 0. 1. ... 0. 1. 0.]
 [1. 0. 0. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 1.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]
[[0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 ...
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]


Because there is no missing data, this is the end of the data preprocessing phase. Decision trees and random forests are not sensitive to feature scaling because their splits don’t change with any monotonic transformation, so we don't need to apply feature scaling.   (https://forecastegy.com/posts/do-decision-trees-need-feature-scaling-or-normalization/)

In [None]:
from sklearn.tree import DecisionTreeClassifier     #Decision tree is already implemented in the sklearn library
classifier = DecisionTreeClassifier(criterion="entropy", random_state=0)
                                                      #criterion is function to measure the quality of a split

classifier.fit(X_train, Y_train)  #we have to train our model on the train set data