In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

cars = pd.read_csv('auto-mpg.csv')
print(cars.head())

     18  8    307  130  3504    12  70  1 chevrolet chevelle malibu
0  15.0  8  350.0  165  3693  11.5  70  1         buick skylark 320
1  18.0  8  318.0  150  3436  11.0  70  1        plymouth satellite
2  16.0  8  304.0  150  3433  12.0  70  1             amc rebel sst
3  17.0  8  302.0  140  3449  10.5  70  1               ford torino
4  15.0  8  429.0  198  4341  10.0  70  1          ford galaxie 500


Importing the dataset. Note that there are no column names. Next step is to add them manually. This happens because sometimes the data is used from a .txt and then converted to a .csv format.

In [2]:
cars.columns = ['mpg','cylinders','displacement','horsepower','weight','acceleration','year','origin','car_name']
print(cars.head())

    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  15.0          8         350.0        165    3693          11.5    70   
1  18.0          8         318.0        150    3436          11.0    70   
2  16.0          8         304.0        150    3433          12.0    70   
3  17.0          8         302.0        140    3449          10.5    70   
4  15.0          8         429.0        198    4341          10.0    70   

   origin            car_name  
0       1   buick skylark 320  
1       1  plymouth satellite  
2       1       amc rebel sst  
3       1         ford torino  
4       1    ford galaxie 500  


In [3]:
unique_regions = cars['origin'].unique()
print(unique_regions)

[1 3 2]


The problem statement or our aim here is to predict or classify the "origin" of the vehicle depending on the other given parameters.
As you can see there are 3 unique regions:
1. 1-->North America
2. 2-->Europe
3. 3-->Asia

In [4]:
dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
cars = pd.concat([cars, dummy_cylinders], axis=1)
print(cars.head())

dummy_years = pd.get_dummies(cars['year'], prefix='year')
cars = pd.concat([cars, dummy_years], axis=1)
cars.drop(['year','cylinders'],inplace=True,axis=1)
print(cars.head())

    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  15.0          8         350.0        165    3693          11.5    70   
1  18.0          8         318.0        150    3436          11.0    70   
2  16.0          8         304.0        150    3433          12.0    70   
3  17.0          8         302.0        140    3449          10.5    70   
4  15.0          8         429.0        198    4341          10.0    70   

   origin            car_name  cyl_3  cyl_4  cyl_5  cyl_6  cyl_8  
0       1   buick skylark 320    0.0    0.0    0.0    0.0    1.0  
1       1  plymouth satellite    0.0    0.0    0.0    0.0    1.0  
2       1       amc rebel sst    0.0    0.0    0.0    0.0    1.0  
3       1         ford torino    0.0    0.0    0.0    0.0    1.0  
4       1    ford galaxie 500    0.0    0.0    0.0    0.0    1.0  
    mpg  displacement horsepower  weight  acceleration  origin  \
0  15.0         350.0        165    3693          11.5       1   
1  18.0        

We have to create numeric representation of categorical values yourself. For this dataset, categorical variables exist in three columns, cylinders, year, and origin. The cylinders and year columns must be converted to numeric values so we can use them to predict label origin. Even though the column year is a number, we’re going to treat them like categories. The year 71 is unlikely to relate to the year 70 in the same way those two numbers do numerically, but rather just as two different labels. In these instances, it is always safer to treat discrete values as categorical variables.

We must use dummy variables for columns containing categorical values. Whenever we have more than 2 categories, we need to create more columns to represent the categories. Since we have 5 different categories of cylinders, we could use 3, 4, 5, 6, and 8 to represent the different categories. We can split the column into separate binary columns:

cyl_3 -- Does the car have 3 cylinders? 0 if False, 1 if True.
cyl_4 -- Does the car have 4 cylinders? 0 if False, 1 if True.
cyl_5 -- Does the car have 5 cylinders? 0 if False, 1 if True.
cyl_6 -- Does the car have 6 cylinders? 0 if False, 1 if True.
cyl_8 --Does the car have 8 cylinders? 0 if False, 1 if True.

In [5]:
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]

highest_train_row = int(cars.shape[0] * .70)
train = shuffled_cars.iloc[0:highest_train_row]
test = shuffled_cars.iloc[highest_train_row:]

Partitioning 70% data for training and 30% for testing. We now have to train a Multiclass Logistic Regression model.

In [8]:
from sklearn.linear_model import LogisticRegression

unique_regions.sort()
models = {}
features = [c for c in train.columns if c.startswith('cyl') or c.startswith('year')]

for origin in unique_regions:
    model = LogisticRegression()
    X_train = train[features]
    y_train = train['origin'] == origin
    
    model.fit(X_train,y_train)
    models[origin] = model

In the one-vs-all approach, we're essentially converting an n-class (in our case n is 3) classification problem into n binary classification problems. For our case, we'll need to train 3 models:

A model where all cars built in North America are considered Positive (1) and those built in Europe and Asia are considered Negative (0).
A model where all cars built in Europe are considered Positive (1) and those built in North America and Asia are considered Negative (0).
A model where all cars built in Asia are labeled Positive (1) and those built in North America and Europe are considered Negative (0).
Each of these models is a binary classification model that will return a probability between 0 and 1. When we apply this model on new data, a probability value will be returned from each model (3 total). For each observation, we choose the label corresponding to the model that predicted the highest probability.

In [9]:
testing_probs = pd.DataFrame(columns = unique_regions)

for origin in unique_regions:
    testing_probs[origin] = models[origin].predict_proba(test[features])[:,1]

Now that we have a model for each category, we can run our test dataset through the models and evaluate how well they performed.

Now that we trained the models and computed the probabilities in each origin we can classify each observation. To classify each observation we want to select the origin with the highest probability of classification for that observation.

While each column in our dataframe testing_probs represents an origin we just need to choose the one with the largest probability.

In [10]:
predicted_origins = testing_probs.idxmax(axis = 1)
print(predicted_origins)

0      1
1      1
2      1
3      1
4      2
5      1
6      2
7      1
8      2
9      1
10     1
11     1
12     1
13     1
14     1
15     1
16     1
17     1
18     1
19     1
20     1
21     1
22     3
23     2
24     1
25     1
26     3
27     1
28     3
29     2
      ..
90     1
91     3
92     1
93     1
94     1
95     1
96     1
97     1
98     3
99     2
100    1
101    1
102    1
103    3
104    1
105    1
106    1
107    1
108    1
109    3
110    1
111    1
112    2
113    1
114    1
115    2
116    1
117    1
118    1
119    3
dtype: int64


Let's work out the accuracy of the model.

In [11]:
test['predicted_origins'] = predicted_origins
correct_predictions = test[test['origin'] == test['predicted_origins']]
accuracy = len(correct_predictions) / len(test)
print(accuracy)

0.175


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
