In [1]:
import pandas as pd
import numpy as np
pd.set_option('max_columns', None)

In [2]:
cars = pd.read_csv('/storage/emulated/0/DataQuest/Datasets/auto.csv')


In [3]:
cars

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1
...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86.0,2790.0,15.6,82,1
388,44.0,4,97.0,52.0,2130.0,24.6,82,2
389,32.0,4,135.0,84.0,2295.0,11.6,82,1
390,28.0,4,120.0,79.0,2625.0,18.6,82,1


In [4]:
unique_origins = cars['origin'].unique()

In [5]:
unique_origins

array([1, 3, 2])

In [6]:
cars.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight          float64
acceleration    float64
year              int64
origin            int64
dtype: object

For this dataset, categorical variables exist in three columns: `cylinders`, `year`, and `origin`. The cylinders and year columns must be converted to numeric values so we can use them to predict label `origin`. 

Even though the column `year` is a number, we’re going to treat them like categories. The year `71` is unlikely to relate to the year `70` in the same way those two numbers do numerically, but rather just as two different labels. In these instances, it is always safer to treat discrete values as categorical variables.

In [7]:
print(cars.cylinders.value_counts())

4    199
8    103
6     83
3      4
5      3
Name: cylinders, dtype: int64


### Dummy Variables

We must use dummy variables for columns containing categorical values. Whenever we have more than 2 categories, we need to create more columns to represent the categories. Since we have 5 different categories of cylinders, we could use **3**, **4**, **5**, **6**, and **8** to represent the different categories. We can split the column into separate binary columns:

We can use the `pandas.get_dummies()` function to return a Dataframe containing binary columns from the values in the cylinders column. In addition, if we set the prefix parameter to **cyl**, Pandas will pre-pend the column names to match the style we'd like, we then use the `pandas.concat()` function to add the columns from this Dataframe back to cars:

In [8]:
dummy_cyl = pd.get_dummies(cars['cylinders'], prefix='cyl')
dummy_years = pd.get_dummies(cars['year'], prefix='year')



In [9]:
cars = pd.concat([cars, dummy_cyl], axis=1)
cars = pd.concat([cars, dummy_years], axis=1)
cars.drop(['cylinders', 'year'], axis=1, inplace=True)

cars

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8,year_70,year_71,year_72,year_73,year_74,year_75,year_76,year_77,year_78,year_79,year_80,year_81,year_82
0,18.0,307.0,130.0,3504.0,12.0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
1,15.0,350.0,165.0,3693.0,11.5,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
2,18.0,318.0,150.0,3436.0,11.0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
3,16.0,304.0,150.0,3433.0,12.0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
4,17.0,302.0,140.0,3449.0,10.5,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
387,27.0,140.0,86.0,2790.0,15.6,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
388,44.0,97.0,52.0,2130.0,24.6,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
389,32.0,135.0,84.0,2295.0,11.6,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
390,28.0,120.0,79.0,2625.0,18.6,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


When we have 3 or more categories, we call the problem a **multiclass classification** problem. There are a few different methods of doing multiclass classification and in this mission, we'll focus on the **one-versus-all** method.

The **one-versus-all** method is a technique where we choose a single category as the Positive case and group the rest of the categories as the False case. We're essentially splitting the problem into multiple binary classification problems. For each observation, the model will then output the probability of belonging to each category.

To start let's split our data into a training and test set.

In [10]:
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]

In [11]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(shuffled_cars, train_size=0.7)


In the one-vs-all approach, we're essentially converting an n-class (in our case n is 3) classification problem into n binary classification problems. For our case, we'll need to train 3 models:

- A model where all cars built in **North America** are considered Positive (1) and those built in **Europe** and **Asia** are considered Negative (0).

- A model where all cars built in **Europe** are considered Positive (1) and those built in **North America** and **Asia** are considered Negative (0).

- A model where all cars built in **Asia** are labeled Positive (1) and those built in **North America** and **Europe** are considered Negative (0).

Each of these models is a **binary classification** model that will return a probability between **0** and **1**. When we apply this model on new data, a probability value will be returned from each model (3 total). For each observation, we choose the label corresponding to the model that predicted the highest probability.

We'll use the dummy variables we created from the cylinders and year columns to train 3 models using the LogisticRegression class from scikit-learn.

In [13]:
unique_origins.sort()
print(unique_origins)

[1 2 3]


In [62]:
from sklearn.linear_model import LogisticRegression

models = {}
features = [var for var in train.columns if 
            var.startswith('year') or var.startswith('cyl')]

for origin in unique_origins:
    model = LogisticRegression()
    
    x_train = train[features]
    y_train = train['origin'] == origin
    
    model.fit(x_train, y_train)
    models[origin] = model

    


In [15]:
models

{1: LogisticRegression(), 2: LogisticRegression(), 3: LogisticRegression()}

In [74]:
testing_probs = pd.DataFrame(columns=unique_origins)  

for origin in unique_origins:
    # Select testing features.
    x_test = test[features]   
    # Compute probability of observation being in the origin.
    testing_probs[origin] = models[origin].predict_proba(x_test)[:,1]
    
    

In [87]:
testing_probs

Unnamed: 0,1,2,3
0,0.968790,0.020217,0.026215
1,0.938113,0.012847,0.103483
2,0.977194,0.010522,0.032295
3,0.857348,0.065792,0.062360
4,0.512011,0.151580,0.323651
...,...,...,...
113,0.953166,0.040206,0.019381
114,0.926897,0.031458,0.061602
115,0.263314,0.422792,0.307051
116,0.512011,0.151580,0.323651


Now that we have trained the models and computed the probabilities in each origin we can classify each observation. To classify each observation we want to select the origin with the highest probability of classification for that observation.

While each column in our dataframe **testing_probs** represents an origin we just need to choose the one with the largest probability. We can use the Dataframe method `.idxmax()` to return a Series where each value corresponds to the column or where the maximum value occurs for that observation. We need to make sure to set the `axis` paramater to **1** since we want to calculate the maximum value across columns. Since each column maps directly to an origin the resulting Series will be the classification from our model.

In [86]:
predicted_origins = testing_probs.idxmax(axis=1)

predicted_origins

0      1
1      1
2      1
3      1
4      1
      ..
113    1
114    1
115    2
116    1
117    1
Length: 118, dtype: int64