# Classifying cars

## Agenda
        1 - Data Exploration
        2 - Encoding features
                zip() Short Exemple - just see what is happening
        3 - Training the model
                training and test subsets Short Exemple - just see what is happening 
            3.1 Implementing K-Nearest Neighbors
        4 - Predictions vs Actual Values

In [58]:
# Data manipulation
import pandas as pd
# Numerical python
import numpy as np

# Machine Learning
import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model, preprocessing

In [2]:
data = pd.read_csv('car.data')

In [6]:
data.head()

Unnamed: 0,buying,maint,door,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


## 1 - Data Exploration

In [7]:
# Number of (rows, columns)

data.shape

(1728, 7)

In [8]:
# Basic information on all columns (No Null values, this is good hehe)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   door      1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [16]:
# Basic statistics on numeric columns (as it is a 'classifier' dataset we have few unique values)

data.describe()

Unnamed: 0,buying,maint,door,persons,lug_boot,safety,class
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,5,3,3,3,4
top,med,med,4,4,big,med,unacc
freq,432,432,432,576,576,576,1210


## 2 - Encoding features        
        We Want to avoid using features with non-numerical data,
        because we're doing computations with them and we can't 
        do operations with data that is not numerical. That means
        we'll have to convert these non-numerical data into numerical data.

**Here's where the sklearn preprocessing acts. Its method can help us in this task** 

In [21]:
# LabelEncoder() will be responsible for turning the labels into appropriate integer values

label_encoder = preprocessing.LabelEncoder()

**Encoding the levels of categorical features into numeric values**

In [25]:
# fit_transform() takes a list(each_of_our['columns']) and will return to us an array containing our new numerical values.

buying = label_encoder.fit_transform(list(data['buying']))
maint = label_encoder.fit_transform(list(data["maint"]))
door = label_encoder.fit_transform(list(data["door"]))
persons = label_encoder.fit_transform(list(data["persons"]))
lug_boot = label_encoder.fit_transform(list(data["lug_boot"]))
safety = label_encoder.fit_transform(list(data["safety"]))
class_ = label_encoder.fit_transform(list(data["class"]))

The result:

In [40]:
data.buying[[0,1,100,800,900,1400]]

0       vhigh
1       vhigh
100     vhigh
800      high
900       med
1400      low
Name: buying, dtype: object

In [38]:
# Hehe
buying[[0,1,100,800,900,1400]]

array([3, 3, 3, 0, 2, 1], dtype=int64)

**Now we'll recombine our data into a feature list and a label list. We can use the zip() function to make things easier**

        The zip() function returns a zip object, which is an iterator of tuples
        where the first item in each passed iterator is paired together, and then
        the second item in each passed iterator are paired together etc.

### |----- _zip() Short Exemple - just see what is happening_ -----|

In [56]:
fruits = ("Apple", "Banana", "Lemon")
fruit_colors = ("Red", "Yellow", "Green")

zip_test = zip(fruits, fruit_colors)

print(tuple(zip_test)) # Here we need to cast it as a tuple in order to show the elements

(('Apple', 'Red'), ('Banana', 'Yellow'), ('Lemon', 'Green'))


**|----- _End of exemple [ \o/ ]_ ------|**

In [49]:
predict = 'class'

# Converting our numerical data into a list, so we can apply ML methods on them
features_X = list(zip(buying, maint, door, persons, lug_boot, safety))
labels_y = list(class_)

In [52]:
# Just like in the exemple
features_X[0:4]

[(3, 3, 0, 0, 2, 1),
 (3, 3, 0, 0, 2, 2),
 (3, 3, 0, 0, 2, 0),
 (3, 3, 0, 0, 1, 1)]

## 3 - Training the model 

In [59]:
'''  Splitting the features and labels into random train and test subsets '''

features_X_train, features_X_test, labels_y_train, labels_y_test = sklearn.model_selection.train_test_split(features_X,
                                                                                                           labels_y,
                                                                                                           test_size=0.1)
# 0.1 (10%) of the data is being allocated as test data while the other 90% is being treated as training data

**[features_X_train and labels_y_train] will be used to train our model**<br>
(and make the machine learn)

**[features_X_test and labels_y_test] will be used to test the accuracy of our model**<br>
(ratio of number of correct predictions to the total number of input samples)

### |----- _training and test subsets Short Exemple - just see what is happening_ -----|

In [60]:
''' Here's values for X and y '''
X , y = np.arange(10).reshape((5, 2)), np.arange(5)
print('X:\n',X)
print('y:\n',y)

X:
 [[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]
y:
 [0 1 2 3 4]


In [61]:
''' What we are doing '''

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)

print('Random X_train values:\n',X_train)
print('\nRandom y_train values:\n',y_train)
print('\nRandom X_test values:\n',X_test)
print('\nRandom y_test values:\n',y_test)

Random X_train values:
 [[0 1]
 [4 5]
 [8 9]
 [6 7]]

Random y_train values:
 [0 2 4 3]

Random X_test values:
 [[2 3]]

Random y_test values:
 [1]


#### |----- _End of exemple [ \o/ ]_ -----|

### 3.1 - Implementing K-Nearest Neighbors

In [97]:
# KNeighborsClassifier() object receives 1 parameter (the number of neighbors we want)

classifier_model = KNeighborsClassifier(n_neighbors=5)

In [98]:
# Training the model

classifier_model.fit(features_X_train, labels_y_train)

# Using the test subset to validate the model
accuracy = classifier_model.score(features_X_test, labels_y_test)
print(accuracy)

0.9248554913294798


**Now let's change the number of neighbors and see how this affects the accuracy**

In [91]:
classifier_model_3 = KNeighborsClassifier(n_neighbors=3)

classifier_model_3.fit(features_X_train, labels_y_train)
accuracy_3 = classifier_model_3.score(features_X_test, labels_y_test)
print(accuracy_3)

0.8497109826589595


In [92]:
classifier_model_7 = KNeighborsClassifier(n_neighbors=7)

classifier_model_7.fit(features_X_train, labels_y_train)
accuracy_7 = classifier_model_7.score(features_X_test, labels_y_test)
print(accuracy_7)

0.9479768786127167


In [93]:
classifier_model_9 = KNeighborsClassifier(n_neighbors=9)

classifier_model_9.fit(features_X_train, labels_y_train)
accuracy_9 = classifier_model_9.score(features_X_test, labels_y_test)
print(accuracy_9)

0.953757225433526


In [95]:
classifier_model_15 = KNeighborsClassifier(n_neighbors=15)

classifier_model_15.fit(features_X_train, labels_y_train)
accuracy_15 = classifier_model_15.score(features_X_test, labels_y_test)
print(accuracy_15)

0.8728323699421965


**Too few neighbors and the accuracy goes down, too many neighbors and it still goes down**
   
   
    Neighbors   Accuracy
        3   --->  0.84
        7   --->  0.94
        9   --->  0.95
        15  --->  0.87

### 4 - Predictions vs Actual Values

In [112]:
# Doing this, We'll get not just the number(encoded_numerical_value) but its actual meaning
names = ["unacc", "acc", "good", "vgood"]

predicted_values = classifier_model.predict(features_X_test)

break_point = 0 # I don't want to print the whole dataset
for value in range(len(features_X_test)):
    print('Predicted value: ', predicted_values[value], '-->', names[predicted_values[value]])
    print('Input Data: ', features_X_test[value])
    print('Actual value', labels_y_test[value], '  -->   ', names[labels_y_test[value]])
    print('-'*50,'\n')
    if break_point == 4: break
    break_point += 1

Predicted value:  0 --> unacc
Input Data:  (2, 3, 2, 2, 1, 2)
Actual value 0   -->    unacc
-------------------------------------------------- 

Predicted value:  2 --> good
Input Data:  (2, 2, 1, 0, 0, 1)
Actual value 2   -->    good
-------------------------------------------------- 

Predicted value:  2 --> good
Input Data:  (0, 2, 3, 0, 2, 1)
Actual value 2   -->    good
-------------------------------------------------- 

Predicted value:  2 --> good
Input Data:  (3, 2, 2, 0, 1, 1)
Actual value 2   -->    good
-------------------------------------------------- 

Predicted value:  2 --> good
Input Data:  (3, 2, 3, 0, 0, 0)
Actual value 2   -->    good
-------------------------------------------------- 

