## Section 5: Machine Learning
Using the dataset from https://archive.ics.uci.edu/ml/datasets/Car+Evaluation, create a machine learning model to predict the buying price given the following parameters:

Maintenance = High
Number of doors = 4
Lug Boot Size = Big
Safety = High
Class Value = Good


In [2]:
## Data Exploration

import pandas as pd

mydf = pd.read_csv('cardata.csv')

print(mydf.head(5))
mydf['doors'].describe()

   Price  Maint doors persons lug_boot safety  class
0  vhigh  vhigh     2       2    small    low  unacc
1  vhigh  vhigh     2       2    small    med  unacc
2  vhigh  vhigh     2       2    small   high  unacc
3  vhigh  vhigh     2       2      med    low  unacc
4  vhigh  vhigh     2       2      med    med  unacc


count     1728
unique       4
top          2
freq       432
Name: doors, dtype: object

Data is completely categorical.

Use onehotencoding to perform simple logistic regression and see.

In [39]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
target = mydf['class']
X = mydf.loc[:, ~mydf.columns.isin(['class','Price'])]

enc.fit(X)

enc.categories_
independ_variables_OHE = enc.transform(X).toarray()
# enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
# enc.get_feature_names_out(['gender', 'group'])

In [40]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    independ_variables_OHE, 
    target, 
    test_size=0.33, 
    random_state=42)

In [41]:
from sklearn.linear_model import LogisticRegression

# Create an instance of Logistic Regression Classifier and fit the data.
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)

print(logreg.score(X_test, y_test))

0.8266199649737302


### Observations from first round of data.

Very poor accuracy from simple logistic regression. Some exploration of cooccurrence has to be done., and feature extraction can be done. Try Decision trees?

In [42]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()

clf = clf.fit(X_train, y_train)

print(clf.score(X_test, y_test))

0.7618213660245184


Worse. Do some data exploration.

In [45]:
for col in mydf.columns:
    print(mydf[col].value_counts())   

vhigh    432
high     432
med      432
low      432
Name: Price, dtype: int64
vhigh    432
high     432
med      432
low      432
Name: Maint, dtype: int64
2        432
3        432
4        432
5more    432
Name: doors, dtype: int64
2       576
4       576
more    576
Name: persons, dtype: int64
small    576
med      576
big      576
Name: lug_boot, dtype: int64
low     576
med     576
high    576
Name: safety, dtype: int64
unacc    1210
acc       384
good       69
vgood      65
Name: class, dtype: int64


Each row is unique set of data. Dataset is not suitable for predicting Price using the other independent variables. Try to run regression purely on the supposed dependent variable.

In [49]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
target = mydf['Price']
X = mydf.loc[:, ['class']]

enc.fit(X)

enc.categories_
independ_variables_OHE = enc.transform(X).toarray()
# enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
# enc.get_feature_names_out(['gender', 'group'])
print(enc.categories_)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    independ_variables_OHE, 
    target, 
    test_size=0.33, 
    random_state=42)

from sklearn.linear_model import LogisticRegression

# Create an instance of Logistic Regression Classifier and fit the data.
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)

print(logreg.score(X_test, y_test))

[array(['acc', 'good', 'unacc', 'vgood'], dtype=object)]
[[0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 ...
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]]
0.318739054290718
