# Machine Learning

Using the dataset from https://archive.ics.uci.edu/ml/datasets/Car+Evaluation, create a machine learning model to predict the buying price given the following parameters:

**Attributes**  
buying:   vhigh, high, med, low  
maint:    vhigh, high, med, low  
doors:    2, 3, 4, 5more  
persons:  2, 4, more  
lug_boot: small, med, big  
safety:   low, med, high  

## Data exploration

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('car.data', header=None)

In [3]:
# Add headers to the dataframe
df.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']

In [4]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [5]:
df.shape

(1728, 7)

In [6]:
# Check for nulls
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [7]:
# Summarize each feature
df.apply(lambda x: x.value_counts()).T.stack()

buying    high      432.0
          low       432.0
          med       432.0
          vhigh     432.0
maint     high      432.0
          low       432.0
          med       432.0
          vhigh     432.0
doors     2         432.0
          3         432.0
          4         432.0
          5more     432.0
persons   2         576.0
          4         576.0
          more      576.0
lug_boot  big       576.0
          med       576.0
          small     576.0
safety    high      576.0
          low       576.0
          med       576.0
class     acc       384.0
          good       69.0
          unacc    1210.0
          vgood      65.0
dtype: float64

## Feature selection

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import OrdinalEncoder

In [9]:
x = df.drop(['class'], axis=1)
y = df['class']

In [10]:
# Transform categorical features as integer array (for use with classification)
enc = OrdinalEncoder()
x = enc.fit_transform(x)

In [11]:
x

array([[3., 3., 0., 0., 2., 1.],
       [3., 3., 0., 0., 2., 2.],
       [3., 3., 0., 0., 2., 0.],
       ...,
       [1., 1., 3., 2., 0., 1.],
       [1., 1., 3., 2., 0., 2.],
       [1., 1., 3., 2., 0., 0.]])

In [12]:
enc.categories_

[array(['high', 'low', 'med', 'vhigh'], dtype=object),
 array(['high', 'low', 'med', 'vhigh'], dtype=object),
 array(['2', '3', '4', '5more'], dtype=object),
 array(['2', '4', 'more'], dtype=object),
 array(['big', 'med', 'small'], dtype=object),
 array(['high', 'low', 'med'], dtype=object)]

In [13]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1) # 70% training and 30% test

In [14]:
x_train.shape, x_test.shape

((1209, 6), (519, 6))

In [15]:
y_train.shape, y_test.shape

((1209,), (519,))

## Decision Tree Model

In [16]:
clf = DecisionTreeClassifier()

In [17]:
clf = clf.fit(x_train, y_train)

In [18]:
y_pred = clf.predict(x_test)

In [19]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9691714836223507
