# Identifying Cars from Unique Locations

**Project goal: to demonstrate an understanding of Multi-Class Classification.**

We are going to use a dataset with a wealth of information about cars to predict the origin of the vehicle; either `North America`, `Europe`, or `Asia`. The dataset is hosted by the University of California, Irvine on their machine learning repository. You can [download the data here]("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"). This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition circa July 7, 1993.

Attribute Information:

    1. mpg:           continuous
    2. cylinders:     multi-valued discrete (integer, ordinal, and categorical)
    3. displacement:  continuous
    4. horsepower:    continuous
    5. weight:        continuous
    6. acceleration:  continuous
    7. model year:    multi-valued discrete (integer, categorical)
    8. origin:        multi-valued discrete
    9. car name:      string (unique for each instance)(integer, categorical)

In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)

# read in the data file
cars = pd.read_csv('auto-mpg.data', 
                   sep='\s+', 
                   names=['mpg','cylinders','displacement','horsepower','weight',
                          'acceleration','model_year','origin','car_name']
                  )
cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [2]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car_name      398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


## Dummy Variables

We must use **dummy variables** to represent any columns that have categorical variables with more than two categories. For example, in `cylinders`, we can use *3, 4, 5, 6,* and *8* to represent the different categories. Then, we split the columns into separate binary columns.

We are going to set the `prefix` parameter to `cyl` so that Pandas will pre-pend the new column names.

Next, the `pandas.concat()` function will add the dummy dataframe back to cars.

In [3]:
# get dummies for the cylinders column
dummy_cyl = pd.get_dummies(cars['cylinders'], prefix='cyl')
cars = pd.concat([cars, dummy_cyl], axis=1)

# get dummies for the model_year column
dummy_years = pd.get_dummies(cars['model_year'], prefix='year')
cars = pd.concat([cars, dummy_years], axis=1)

# drop the original categorical columns
cars = cars.drop(['cylinders','model_year'], axis=1)
print("DataFrame Shape: ", cars.shape)
cars.head()

DataFrame Shape:  (398, 25)


Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin,car_name,cyl_3,cyl_4,cyl_5,...,year_73,year_74,year_75,year_76,year_77,year_78,year_79,year_80,year_81,year_82
0,18.0,307.0,130.0,3504.0,12.0,1,chevrolet chevelle malibu,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,15.0,350.0,165.0,3693.0,11.5,1,buick skylark 320,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,18.0,318.0,150.0,3436.0,11.0,1,plymouth satellite,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,16.0,304.0,150.0,3433.0,12.0,1,amc rebel sst,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,17.0,302.0,140.0,3449.0,10.5,1,ford torino,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Multi-Class Classification

There are more than one different ways to do multi-class classification; we'll be focusing on the **one-vs-all** method. Essentially, our problem becomes many binary classification problems.

We can start by splitting our data into `train` and `test` sets. For good measure, we will randomize the `cars` dataframe. For verification purposes, 70% of the data will go into the `train` dataset.

Overall we are changing an n-class classification problem into `n` binary classification problems. Therefore, we need to train three models.
 - one, where cars built in North America are positive and all others are negative,
 - two, where cars built in Europe are positive and all others are negative,
 - one, where cars built in Asia are positive and all others are negative,

The dummy variables that we created from the `cylinders` and `year` columns will feed the Logistic Regression class from scikit-learn.

In [4]:
# shuffle data to rid the data of any patterns
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
train = shuffled_cars[0:round((len(shuffled_cars)*0.7))]
test = shuffled_cars[round((len(shuffled_cars))*0.7):]

In [5]:
from sklearn.linear_model import LogisticRegression

# define the target key values representing the different origins
unique_origins = cars["origin"].unique()
unique_origins.sort()

# create a dictionary to add each model to
models = {}

# list comprehension to generate features for regression
features = [c for c in train.columns if c.startswith('cyl') or c.startswith('year')]

# train 3 models for our origins
for i in unique_origins:
    model = LogisticRegression()
    X_train = train[features]
    y_train = train['origin'] == i
    
    model.fit(X_train, y_train)
    models[i] = model

print(models)

{1: LogisticRegression(), 2: LogisticRegression(), 3: LogisticRegression()}


## Testing

Now that we have a model for each category, we can run our test dataset through the models and evaluate how well they perform.

In [6]:
# create a df to contain the predicted probabilities
testing_probs = pd.DataFrame(columns=unique_origins)

# add to the 'testing_probs' dataframe
for i in unique_origins:
    # select testing features
    X_test = test[features]
    # compute probability of observation being in the origin
    testing_probs[i] = models[i].predict_proba(X_test)[:,1]

testing_probs.head()

Unnamed: 0,1,2,3
0,0.963018,0.032446,0.018271
1,0.843672,0.080925,0.078139
2,0.272697,0.205069,0.518609
3,0.792175,0.112188,0.076417
4,0.884702,0.080899,0.051312


Now that we have each column in a dataframe representing an origin, we need to choose the one with the *highest probability*. Since each column maps directly to an origin, **the resulting series will be the classification from our model.**

In [7]:
predicted_origins = testing_probs.idxmax(axis=1)
print(predicted_origins)

0      1
1      1
2      3
3      1
4      1
      ..
114    1
115    1
116    2
117    3
118    2
Length: 119, dtype: int64


## Accuracy

Companies use ML models to make practical business decisions, and more accurate model outcomes result in better decisions. The cost of errors can be huge, but optimizing accuracy mitigates cost.

In [8]:
# compare our predictions to the original regions
result_df = pd.DataFrame(columns=['actual', 'predicted'])
result_df['actual'] = test.origin
result_df['predicted'] = np.array(predicted_origins)

from sklearn.metrics import accuracy_score

accuracy_score(test.origin, predicted_origins)

0.680672268907563

Now that we know how accurate our model is, we should look at some other models to try to improve accuracy. We are going to train several types fo models (Random Forest, K-Nearest Neighbors Classifier, and Support Vector Classifier) and check the prediction accuracy using `accuracy_score` from Scikit-learn's `metrics` module.

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# make our train and test data into cariables for predictive analysis
X_train = train[features]
y_train = train['origin']
X_test = test[features]
y_test = test['origin']

# train a Logistic model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print('Random Forest ------- accuracy score:', accuracy_score(y_test, y_pred))

# train a KNN model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print('K-Nearest Neighbors - accuracy score:', accuracy_score(y_test, y_pred))

# train a SVM classifier model
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print('Support Vector ------ accuracy score:', accuracy_score(y_test, y_pred))

Random Forest ------- accuracy score: 0.6974789915966386
K-Nearest Neighbors - accuracy score: 0.6554621848739496
Support Vector ------ accuracy score: 0.6638655462184874


# Conclusion

Overall, our multi-class classifier did not predict as accurately as I had hoped, but it did better than other, more vanilla models. To improve our accuracy, we could take some time to do further feature engineering which gives the computer more data for predicting the origin of each observation. We could also elect to include more features when fitting our model.

We explored a transformation-to-binary technique with our one-vs-all method. Another route would be to use *hierachical classification*, where we would divide the output into a tree where each parent node is divided into multiple child nodes and the process is continued until each child node represents only one class. This is the basis for Decision Trees and Random Forests.