## Introduction to the data

* mpg -- Miles per gallon, Continuous.
* cylinders -- Number of cylinders in the motor, Integer, Ordinal, and Categorical.
* displacement -- Size of the motor, Continuous.
* horsepower -- Horsepower produced, Continuous.
* weight -- Weights of the car, Continuous.
* acceleration -- Acceleration, Continuous.
* year -- Year the car was built, Integer and Categorical.
* origin -- Integer and Categorical. 1: North America, 2: Europe, 3: Asia.

In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [27]:
cars = pd.read_csv('auto.csv')
unique_origins = cars['origin'].unique()
print(unique_origins)

[1 3 2]


In [5]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    float64
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    int64  
dtypes: float64(5), int64(3)
memory usage: 24.6 KB


In [11]:
print(cars.year.unique())
print(cars.cylinders.unique())

[70 71 72 73 74 75 76 77 78 79 80 81 82]
[8 4 6 3 5]


## Dummy variables

For this dataset, categorical variables exist in three columns, `cylinders`, `year`, and `origin`. The cylinders and year columns must be converted to numeric values so we can use them to predict label origin

In [15]:
dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
cars = pd.concat([cars, dummy_cylinders], axis=1)

In [16]:
dummy_years = pd.get_dummies(cars['year'], prefix='year')
cars = pd.concat([cars, dummy_years], axis=1)

In [17]:
cars = cars.drop(['year', 'cylinders'], axis=1)
cars.head()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin,cyl_3,cyl_4,cyl_5,cyl_6,...,year_73,year_74,year_75,year_76,year_77,year_78,year_79,year_80,year_81,year_82
0,18.0,307.0,130.0,3504.0,12.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,15.0,350.0,165.0,3693.0,11.5,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,18.0,318.0,150.0,3436.0,11.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,16.0,304.0,150.0,3433.0,12.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,17.0,302.0,140.0,3449.0,10.5,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Multiclass classification

When we have 3 or more categories, we call the problem a **multiclass classification** problem. There are a few different methods of doing multiclass classification and in this mission, we'll focus on the **one-versus-all** method.

In [21]:
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]

In [23]:
train = shuffled_cars[:int(0.7*len(cars))]
test = shuffled_cars[int(0.7*len(cars)):]

## Training a multiclass logistic regression model

In the one-vs-all approach, we're essentially converting an n-class (in our case n is 3) classification problem into n binary classification problems. For our case, we'll need to train 3 models:

* A model where all cars built in North America are considered Positive (1) and those built in Europe and Asia are considered Negative (0).
* A model where all cars built in Europe are considered Positive (1) and those built in North America and Asia are considered Negative (0).
* A model where all cars built in Asia are labeled Positive (1) and those built in North America and Europe are considered Negative (0).

In [29]:
unique_origins.sort()

models = {}
x_train = train.iloc[:,6:]

for origin in unique_origins:
    y_train = (train['origin'] == origin)
    lr = LogisticRegression()
    lr.fit(x_train, y_train)
    models[origin] = lr

## Testing the models

In [32]:
x_test = test.iloc[:,6:]
testing_probs = pd.DataFrame(columns=unique_origins)

for origin in unique_origins:
    y_test = (test['origin'] == origin)
    testing_probs[origin] = models[origin].predict_proba(x_test)[:,1]

In [33]:
testing_probs

Unnamed: 0,1,2,3
0,0.966266,0.018239,0.031172
1,0.871341,0.092257,0.043763
2,0.984842,0.021979,0.007914
3,0.968466,0.019733,0.026522
4,0.947519,0.048176,0.015997
...,...,...,...
113,0.949125,0.024198,0.037165
114,0.970519,0.016778,0.029259
115,0.260136,0.531928,0.200882
116,0.358078,0.294339,0.332230


## Choose the origin

In [38]:
predicted_origins  = testing_probs.idxmax(axis=1)

In [39]:
print(predicted_origins)

0      1
1      1
2      1
3      1
4      1
      ..
113    1
114    1
115    2
116    1
117    3
Length: 118, dtype: int64


In [42]:
accuracy_score(predicted_origins, test['origin'])

0.6271186440677966