The data used in this notebook is taken from https://perso.telecom-paristech.fr/eagan/class/igr204/datasets under a Creative Commons License. The dataset has been cleaned up by Petra Isenberg, Pierre Dragicevic, and Yvonne Jansen.

## Objective

Today, I'll be reading in a dataset of some cars and their features (HP, weight, etc.). Each car also has a place of origin (either Europe, Japan, or the US). 

I'm not a car person: I can't tell whether a car is from the US, Germany, or Japan without knowing the name of the company that made it. This is where machine learning can help. 

My goal is to create a machine learning algorithm that can predict which place a car is from given ONLY its features (HP, weight, etc.).

The algorithm will eventually be presented with a car it has never seen before. It won't know which company made the car or which country it's from. All it knows is the car's MPG, displacement, horsepower, weight, and a few other features. It will use what it knows from the training dataset to predict which place the car is from. 

I will use the K Nearest Neighbors Algorithm to do this. First, we read in our dataset:

In [2]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from collections import Counter

cars = pd.read_csv('cars.csv', sep=';').drop(labels=0,axis=0)

cars.replace("0", np.NaN, inplace=True)
cars.dropna(axis=0, inplace=True)
#remove any cars that have "0" as one of their features (there's no reason why this should be the case)

cars.index = range(len(cars))
#we need to reindex the data set since we just dropped several rows

cars

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin
0,Chevrolet Chevelle Malibu,18.0,8,307.0,130.0,3504.,12.0,70,US
1,Buick Skylark 320,15.0,8,350.0,165.0,3693.,11.5,70,US
2,Plymouth Satellite,18.0,8,318.0,150.0,3436.,11.0,70,US
3,AMC Rebel SST,16.0,8,304.0,150.0,3433.,12.0,70,US
4,Ford Torino,17.0,8,302.0,140.0,3449.,10.5,70,US
...,...,...,...,...,...,...,...,...,...
387,Ford Mustang GL,27.0,4,140.0,86.00,2790.,15.6,82,US
388,Volkswagen Pickup,44.0,4,97.00,52.00,2130.,24.6,82,Europe
389,Dodge Rampage,32.0,4,135.0,84.00,2295.,11.6,82,US
390,Ford Ranger,28.0,4,120.0,79.00,2625.,18.6,82,US


Now I'll make my features and labels lists:


In [3]:
features = cars.drop( columns=['Car', 'Origin', 'Model'] , axis=1)

features = features.astype(float)
#the cells are all stored in string values. We want them as floats

features

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration
0,18.0,8.0,307.0,130.0,3504.0,12.0
1,15.0,8.0,350.0,165.0,3693.0,11.5
2,18.0,8.0,318.0,150.0,3436.0,11.0
3,16.0,8.0,304.0,150.0,3433.0,12.0
4,17.0,8.0,302.0,140.0,3449.0,10.5
...,...,...,...,...,...,...
387,27.0,4.0,140.0,86.0,2790.0,15.6
388,44.0,4.0,97.0,52.0,2130.0,24.6
389,32.0,4.0,135.0,84.0,2295.0,11.6
390,28.0,4.0,120.0,79.0,2625.0,18.6



We must scale all of the feature data to make the distance function in the KNN algorithm work smoother

Otherwise the algo will factor features like weight much more heavily (see what i did there?) than features like cylinders. The weight is a four digit number while the number of cylinders is a one digit number. So a difference in weight between two cars would affect the 'distance' much more than a difference in the number of cylinders. 


In [4]:
scaled_features = preprocessing.scale(features)
#scaled_features is a numpy array

scaled_features_df = pd.DataFrame(scaled_features, columns=['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration'])
scaled_features_df

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration
0,-0.698638,1.483947,1.077290,0.664133,0.620540,-1.285258
1,-1.083498,1.483947,1.488732,1.574594,0.843334,-1.466724
2,-0.698638,1.483947,1.182542,1.184397,0.540382,-1.648189
3,-0.955212,1.483947,1.048584,1.184397,0.536845,-1.285258
4,-0.826925,1.483947,1.029447,0.924265,0.555706,-1.829655
...,...,...,...,...,...,...
387,0.455941,-0.864014,-0.520637,-0.480448,-0.221125,0.021294
388,2.636813,-0.864014,-0.932079,-1.364896,-0.999134,3.287676
389,1.097374,-0.864014,-0.568479,-0.532474,-0.804632,-1.430430
390,0.584228,-0.864014,-0.712005,-0.662540,-0.415627,1.110088


Now that we have our feature values properly cleaned and scaled, let's make our labels:

In [11]:
labels = np.array(cars['Origin'])
labels[:20]
#label_df = pd.DataFrame(cars['Origin'])
#label_df

array(['US', 'US', 'US', 'US', 'US', 'US', 'US', 'US', 'US', 'US', 'US',
       'US', 'US', 'US', 'Japan', 'US', 'US', 'US', 'Japan', 'Europe'],
      dtype=object)

In [5]:
def KNN(new_car, k):
    new_car = preprocessing.scale(new_car)
    distances = []
    
    #we want to take every car and measure its distance from the new car
    #also figure out which place the car is from 
    #for each car, we create a pair with its distance from the new car and its place of origin
    #we append this pair to our distances array
    
    i=0
    for car in scaled_features:
        dist_from_new_car = np.linalg.norm(car - new_car)
        group = labels[i]
        i+=1
        distances.append([dist_from_new_car, group])

    closest_distances = sorted(distances)[:k]
    #sort the distances, then take only the k cars closest to the new car
    
    votes = []
    for i in closest_distances:
        votes.append(i[1])
    
    #we made votes, an list of just k places of origin. 
    #this represents the places of origin for the k closest cars to the new car    
    
    #now each of the k closest cars gets to "cast their vote" for which place the new car is from
    #these votes are stored in, you guessed it, the votes list
    #now we'll return whichever vote is most common, and this will be our guess for the new car's place of origin
    
    result = Counter(votes).most_common(1)[0][0]
    return result



We just made our KNN algorithm. Great! Now we give the algo a brand new car that it has never seen before, and it can tell you where that car is from.

In [6]:
def predict(new_car):
    prediction = KNN(new_car, 5)
    print(prediction)

You can try this for yourself by replacing new_car with the features of whatever classic car you like!

In [15]:
#values in the list represent MPG, cylinders, displacement, HP, weight, and acceleration respectively

new_car = [20.6, 6, 200, 90, 3000, 10]
predict(new_car)

US


Now let's compare our KNN algorithm against the KNN algorithm provided by SciKit Learn:

In [12]:
from sklearn import model_selection, neighbors

x = scaled_features
y = np.array(cars['Origin'])

x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(x_train, y_train)

accuracy = clf.score(x_test, y_test)

print(accuracy)

0.7974683544303798


# To do list:

Right now the dataset has way too many American cars.
This means an American car is more likely to get a vote than a European or Japanese car in the KNN algorithm.
This decreases the KNN algo's accuracy.
Make sure the number of American, Japanese, and European cars are equal.

In [13]:
europe = cars[cars['Origin']=="Europe"]
japan = cars[cars['Origin']=="Japan"]
us = cars[cars['Origin']=="US"]

num_eu_cars = len(europe)
num_jp_cars = len(japan)
num_us_cars = len(us)

print("This dataset has", num_eu_cars, "European cars", num_jp_cars, "Japanese cars, and ", num_us_cars, "American cars.")

This dataset has 68 European cars 79 Japanese cars, and  245 American cars.


Let's randomly drop American and Japanese cars from the dataset until each place has an equal amount of cars.

In [14]:
cars = pd.concat([japan.sample(68), us.sample(68)])
cars = pd.concat([cars, europe])
cars = cars.sample(len(cars))
cars.reset_index(inplace=True)
cars.drop(columns='index', axis=1, inplace=True)
cars

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin
0,Dodge Charger 2.2,36.0,4,135.0,84.00,2370.,13.0,82,US
1,Saab 99LE,25.0,4,121.0,115.0,2671.,13.5,75,Europe
2,Mazda RX2 Coupe,19.0,3,70.00,97.00,2330.,13.5,72,Japan
3,Ford Fairmont (man),25.1,4,140.0,88.00,2720.,15.4,78,US
4,Volkswagen Rabbit Custom,31.9,4,89.00,71.00,1925.,14.0,79,Europe
...,...,...,...,...,...,...,...,...,...
199,Peugeot 604sl,16.2,6,163.0,133.0,3410.,15.8,78,Europe
200,Saab 99e,25.0,4,104.0,95.00,2375.,17.5,70,Europe
201,Datsun 710,24.0,4,119.0,97.00,2545.,17.0,75,Japan
202,Volkswagen Dasher,30.5,4,97.00,78.00,2190.,14.1,77,Europe


Now that we have an equal number of cars from each place, let's run our KNN algorithm again:

In [22]:

features = cars.drop( columns=['Car', 'Model', 'Origin', 'Displacement'])
a = np.array(preprocessing.scale(features))
b = np.array(cars['Origin'])

a_train, a_test, b_train, b_test = model_selection.train_test_split(a, b, test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(a_train, b_train)

accuracy = clf.score(a_test, b_test)

print(accuracy)

0.8292682926829268


This algorithm works nicely, but I know there's even more room for improvement. As I learn more about optimizing data for Machine Learning methods, I'll come back and update this notebook. Thank you for giving this a read. Feel free to let me know if you have suggestions!